Closed JulienLamour closed 1 year ago
@JulienLamour I am somewhat indifferent but I do think any final prepared data should/could be in .csv format. The intermediate files could stay as Rdata. Why Rdata is nice is that they are fast to load and they can be multi-dimensional
Long term sqlite is probably a better way to contain all prepared data but no need to do that yet.
I've also messed around with fst files in the past: https://github.com/fstpackage/fst/
Its a nice package and definitely shrinks and speeds dataframe operations.
I support having csv as the final data product format. As far as I know it's the most universal.
Ok, I ll add a "curated dataset" folder where we maintain the finished overall database in the form of several csv (the same as in the data curation pdf that I showed you). I ll keep the Rdata as intermediate files.
@regnans I agree, very universal. I think for the intermediate we dont need to do that since it could then lead to increasing the number of files because instead of embedding multiple dataframes and data types in one file, we would then have a CSV for each. But in terms of the final gasex dataset and final spectra dataset, the files users will use, it should be csv - though they are more "expensive" (size, speed) than other formats (e.g. fst)
For sure. Only final products need be csv, no need to "export" intermediate products.
Hey Shawn @serbinsh ,
As you know, to start I used the Rdata format as intermediate outputs of the pipeline for each dataset. Before starting to curate everything in a hopefully final and stable state, maybe we should consider using .csv instead. What do you think?