ropensci / arkdb

Archive and unarchive databases as flat text files
https://docs.ropensci.org/arkdb
Other
78 stars 6 forks source link

Adding a callback function for data adjustment #47

Closed 1beb closed 3 years ago

1beb commented 3 years ago

There is often a need to prepare, reorder, sort, or recode raw data into a useable format. For example, databases often store categorical information in a binary (0/1) or integer format (1,2,3,4,5). Researchers or others that are using these data often expect to receive them with encoding built-in. Adding a callback to the ark function could allow for this type of "analytical preparation" to be completed in a just-in-time basis. Simplifying some typical database <> in-memory analytical work flows. Combined with the new filter_statement this could be used for setting up common data pipelines where a window of data are required, perhaps summarized, modeled, or otherwise aggregated.

cboettig commented 3 years ago

Nice suggestion, I like this a lot. We see this all over the place in the fishbase data, for instance, where many boolean variables are still typed as integers in the SQL dumps.

What would the interface for this look like though? I do wonder if the efficiency of the just-in-time approach is worth the additional complexity? arguably it might be simpler for most users to handle these cases by mutating database tables directly?

1beb commented 3 years ago

Take a look at the implementation in #45 and example usage in https://github.com/ropensci/arkdb/pull/45/files#diff-6eb88ccdea6455edc8341a00ed31f6c8b3ff6c51856dbb832edffc8b7df9cde3R48. I think the implementation is really straightforward and the function definition quite simple. One piece, that may be forgotten here is that R users may not also be SQL users. This allows them to keep transformations that, arguably could be easier in SQL, in R.

You would write a function that accepts data as it's argument:

please_callback <- function(data) {
   data$somevar <- amazing_transformation(data$somevar)
   # ... more amazing transformations
   data
}

Then supply that to ark:

ark(db, ".", callback = place_callback)

The caveat here is that you would likely want to only work on one table at a time, perhaps with different callbacks. I think for bigger-than-toy examples, now including the filter_statement, this creates a swiss army knife of downloading data from a database and getting it locally in the format an analyst needs it in for real work to begin, especially with the addition of parquet which has been getting closer to the defacto for larger than RAM analytics.

cboettig commented 3 years ago

Ah, very nice, the user provides an arbitrary function. Yes, definitely.

Good point about generalizing to multiple tables, but like you say it should be intuitive to use the same interface to work a table at a time.

1beb commented 3 years ago

Closed by #45