Closed santiagohermo closed 1 year ago
Here is a list of suggestions to improve the SaveData
function that I came up with after running into some limitations @jmshapir. If you want to implement any of this I'm happy to (slowly) help. If not we can just close this issue. Thanks!
@santiagohermo thanks! These are helpful. I think for now we can close the issue, but we can reopen and implement if we hit another use case.
I have been working with large data and found some limitations to
SaveData
that prompted to use a local version with a few different functionalities. I think that you may be interested in adding some of those to the function. Find below a list of suggestions.Support
feather
andfst
file formats to save dataIssue:
csv
files are saved quickly withdata.table
but they use a lot of space on disc. Formatsfeather
(from arrow) andfst
(from fst) are also really fast --sometimes even faster-- and also compress the files, potentially saving a lot of space on disc.Proposal: Add these formats to the data dictionary.
Shortcomings of log file generation
Speed
Issue: While I haven't tested formally, I think the way the package computes summary statistics can be slow for large datasets.
Proposal: Change the computation of sumstats here. For that, make
df
a datatable object, and change the code to something like this:See code
```R numeric_vars <- names(dplyr::select_if(dt, is.numeric)) numeric_sum <- t(rbind(dt[, lapply(.SD, mean, na.rm = T), .SDcols = numeric_vars], dt[, lapply(.SD, sd, na.rm = T), .SDcols = numeric_vars], dt[, lapply(.SD, min, na.rm = T), .SDcols = numeric_vars], dt[, lapply(.SD, max, na.rm = T), .SDcols = numeric_vars])) ```Optionally exclude vars
Issue: Sometimes there is a variable for which you don't want to include the summary statistics (because data are confidential, for example).
Proposal: Add a
mask_vars
optional argument that takes a list of variables and excludes them from the numeric sumstats computation.Data with no numeric variables
Issue: In this case the output has a lot of
NULL
s where the sumstats should go.Proposal: Do not compute sumstats, and exclude them from log file, if all variables in dataset are non-numeric
Data loaded from STATA
Issue: Sometimes when you load data the class of the column includes a label and other attributes. This distorts the
type
column of the log fileProposal: Clean the extra column attributes before generating the log file (or maybe even before saving).