odissei-lifecourse / life-sequencing-dutch

MIT License
0 stars 0 forks source link

speed up processing of tabular files #25

Open f-hafner opened 5 months ago

f-hafner commented 5 months ago

this concerns the scripts for extracting summary stats, and possibly the conversion of tables to datasets for LLM training.

for the summary stats, we're currently using the python engine to read in the files. this is because not all csv files use the same column separator. using the c engine would require us knowing and specifying the separator for each file.

moreover, we should consider tools to exploit multiple cores when handling dataframes. dask.dataframe may be a good option, but we should explore a bit more.

tanzir5 commented 5 months ago

It's a good idea to explore Dask, I plan to explore making things faster once we have the fake data. Without doing trial-and-error, it's difficult to estimate what actually makes things faster, and what breaks things.