plantphys / gsti

A project focused on the development of generalized spectra-trait models for the prediction of leaf photosynthetic capacity. This includes models focused on the prediction of leaf nitrogen, leaf mass per area (LMA), leaf water content (LWC), Vcmax, Jmax and dark respiration.
GNU General Public License v3.0
6 stars 1 forks source link

Replace Rdata by csv? #45

Closed JulienLamour closed 1 year ago

JulienLamour commented 1 year ago

Hey Shawn @serbinsh ,

As you know, to start I used the Rdata format as intermediate outputs of the pipeline for each dataset. Before starting to curate everything in a hopefully final and stable state, maybe we should consider using .csv instead. What do you think?

serbinsh commented 1 year ago

@JulienLamour I am somewhat indifferent but I do think any final prepared data should/could be in .csv format. The intermediate files could stay as Rdata. Why Rdata is nice is that they are fast to load and they can be multi-dimensional

Long term sqlite is probably a better way to contain all prepared data but no need to do that yet.

serbinsh commented 1 year ago

I've also messed around with fst files in the past: https://github.com/fstpackage/fst/

Its a nice package and definitely shrinks and speeds dataframe operations.

regnans commented 1 year ago

I support having csv as the final data product format. As far as I know it's the most universal.

JulienLamour commented 1 year ago

Ok, I ll add a "curated dataset" folder where we maintain the finished overall database in the form of several csv (the same as in the data curation pdf that I showed you). I ll keep the Rdata as intermediate files.

serbinsh commented 1 year ago

@regnans I agree, very universal. I think for the intermediate we dont need to do that since it could then lead to increasing the number of files because instead of embedding multiple dataframes and data types in one file, we would then have a CSV for each. But in terms of the final gasex dataset and final spectra dataset, the files users will use, it should be csv - though they are more "expensive" (size, speed) than other formats (e.g. fst)

regnans commented 1 year ago

For sure. Only final products need be csv, no need to "export" intermediate products.