Cache large data objects while working within notebooks

sokolhessnerlab / itrackvalr

Toolkit with a built-in pipeline infrastructure to integrate and analyze eye-tracking, behavioral, and self-report data in R.

MIT License

1 stars 0 forks source link

Cache large data objects while working within notebooks #3

Closed aridyckovsky closed 3 years ago

aridyckovsky commented 3 years ago

Is your feature request related to a problem? Please describe. With eye tracker sampling data up to 2 GB in extracted CSV format, operations that require loading data in chunks can be lengthy and sometimes crash the session.

Describe the solution you'd like Load data from CSV and save as a temporary .RData or .Rds file that can be easily loaded back to a variable during processes like document rendering.

Describe alternatives you've considered None.

Additional context Potentially useful: https://rdrr.io/cran/xfun/man/cache_rds.html. We must also determine whether temporary data caching in a local environment violates any participant data confidentiality issues. @psokolhessner could use your feedback here.

aridyckovsky commented 3 years ago

Any other thoughts on this data caching alternative to avoid re-loading the 2GB dataset of samples each time knitr is run? There is always the non-evaluation option; we tell knitr to ignore evaluations until we're ready to share in a more official notebook rendering that includes tables and figures. This is less ideal because it makes our analysis discourse less streamlined.

psokolhessner commented 3 years ago

Thanks for the nudge @aridyckovsky. There are no data confidentiality issues here since the data is anonymized to begin with, so no concerns there. In the past my main approach with data of this type (neuroimaging data has similar issues; loading is nontrivial) is to plan on doing it the painful way, save out what I need, and then thereafter only access the necessary files (not the original). So would a solution that goes file by file, extracts the main x, y, and time data, for example, be sufficient here, or is the issue deeper than this?

aridyckovsky commented 3 years ago

That's a great question. I haven't checked what the change in total file size is when using the cleaned data. My assumption is that this issue will persist unless the cleaning decreases data size by a few of orders of magnitude (a few MB of data instead GB), but it's worth seeing if the issue can be minimized just by using relatively smaller data. Either way, this is probably an issue worth fixing pre-sharing, i.e., provide some functional mechanism that lets people cache loaded data if needed.

psokolhessner commented 3 years ago

Agreed. If we're dropping enough columns (and potentially separating data e.g. into the diff. temporal epochs of calibration, validation, task, re-validation), then as long as it works at some point, that may be good enough (b/c the needs here will also be affected by many idiosyncrasies, like the computers doing the analysis, the sampling rate of the eyetracker, etc) for these purposes.

aridyckovsky commented 3 years ago

Note that this may be addressed by #18 when complete