qurator-spk / mods4pandas

Extract the MODS/ALTO metadata of a bunch of METS/ALTO files into pandas DataFrames for data analysis
Apache License 2.0
11 stars 0 forks source link

Can't export page info due to OOM #45

Open mikegerber opened 3 months ago

mikegerber commented 3 months ago
Aug 02 07:00:54 b-pc30533 kernel: Out of memory: Killed process 4030869 (mods4pandas) total-vm:70463740kB, anon-rss:28581044kB, file-rss:1232kB, shmem-rss:4kB, UID:1000 pgtables:136832kB oom_score_adj:0

That's a whopping 28 GB memory after reading just 22% of the data...

→ Need a more memory-efficient way to handle this.

mikegerber commented 3 months ago

Considering all options, looks like iteratively build a SQLite database would be the best option (or at least worth trying)

mikegerber commented 2 months ago

I've done some experiments, and the temporary SQLite db seems to be the way to go.