Closed edsu closed 1 month ago
That sounds like it would be worth trying.
I put in a request to double the memory 🤞
We could also experiment with setting low_memory
to True
when calling scan_csv
https://docs.pola.rs/api/python/stable/reference/api/polars.scan_csv.html
We seem to have resolved this for now by lowering the amount of data we are writing to the merged publication parquet file #90
At some point we may want to revisit why we can't use df.collect(streaming=True).sink_parquet
instead of df.collect().write_parquet()
since the former should stream the data to the parquet file rather than requiring it to be built in memory?
I noticed that the
merge_publications
step works fine when using DEV_LIMIT=1000 on my laptop but fails in our production Airflow:https://sul-rialto-airflow-dev.stanford.edu/dags/harvest/grid?dag_run_id=manual__2024-07-10T11%3A06%3A13.060630%2B00%3A00&tab=logs&task_id=merge_publications
The error is:
After retrying a few times I noticed that the server is running out of available memory, which I believe results in the task being killed (either by Airflow, Docker or the VM).
The dimensions and openalex CSVs are quite large:
We currently only have 8GB of RAM available. Since we are using polars and streaming it's not exactly clear to me how much RAM we might need. We could try doubling it and see if that helps?