gslab-econ / gslab_r

5 stars 1 forks source link

Requests for SaveData #27

Closed santiagohermo closed 1 year ago

santiagohermo commented 1 year ago

I have been working with large data and found some limitations to SaveData that prompted to use a local version with a few different functionalities. I think that you may be interested in adding some of those to the function. Find below a list of suggestions.

Support feather and fst file formats to save data

Issue: csv files are saved quickly with data.table but they use a lot of space on disc. Formats feather (from arrow) and fst (from fst) are also really fast --sometimes even faster-- and also compress the files, potentially saving a lot of space on disc.

Proposal: Add these formats to the data dictionary.

Shortcomings of log file generation

Speed

Issue: While I haven't tested formally, I think the way the package computes summary statistics can be slow for large datasets.

Proposal: Change the computation of sumstats here. For that, make df a datatable object, and change the code to something like this:

See code ```R numeric_vars <- names(dplyr::select_if(dt, is.numeric)) numeric_sum <- t(rbind(dt[, lapply(.SD, mean, na.rm = T), .SDcols = numeric_vars], dt[, lapply(.SD, sd, na.rm = T), .SDcols = numeric_vars], dt[, lapply(.SD, min, na.rm = T), .SDcols = numeric_vars], dt[, lapply(.SD, max, na.rm = T), .SDcols = numeric_vars])) ```

Optionally exclude vars

Issue: Sometimes there is a variable for which you don't want to include the summary statistics (because data are confidential, for example).

Proposal: Add a mask_vars optional argument that takes a list of variables and excludes them from the numeric sumstats computation.

Data with no numeric variables

Issue: In this case the output has a lot of NULLs where the sumstats should go.

Proposal: Do not compute sumstats, and exclude them from log file, if all variables in dataset are non-numeric

Data loaded from STATA

Issue: Sometimes when you load data the class of the column includes a label and other attributes. This distorts the type column of the log file

Proposal: Clean the extra column attributes before generating the log file (or maybe even before saving).

santiagohermo commented 1 year ago

Here is a list of suggestions to improve the SaveData function that I came up with after running into some limitations @jmshapir. If you want to implement any of this I'm happy to (slowly) help. If not we can just close this issue. Thanks!

jmshapir commented 1 year ago

@santiagohermo thanks! These are helpful. I think for now we can close the issue, but we can reopen and implement if we hit another use case.