Closed mahrsee1997 closed 2 months ago
Since I’m not employed at Google right now, I can’t in good conscience give this an approval. I think @fredzyda would be a better person to decide if this should be merged. I will say, this patch looks good to me.
Thanks @alxmrs and @fredzyda for the review!
Stats:
Tested on dataset (single file) (longitude: 1440; latitude: 721; level: 13; time: 1; number of data variables: 10):
Total number of rows to be ingested into BQ: 13,497,120 (1440 721 13 *1)
Ran on DataflowRunner. machine_type: n1-standard1; 4gb RAM; 100GB HDD.
Note: In our development project, we have 1000 workers (with no resource restrictions). However, in a real-world scenario, users might not have this many workers, so the time and cost with the main branch would have been significantly higher.
Approach:
--rows_chunk_size <chunk-size>
, users can control the number of rows loaded into memory for processing, depending on their system's memory.Assumption: A minimum of this much memory is available to load all the data variables for (lat × lon) plus a single indexed (apart from lat & lon) at once. I think we can make this assumption because, for a 0.1 resolution dataset (3600 × 1800) with 51 data variables, only 9 GiB of RAM is required.
ps: From the learnings of ARCO-ERA5 to BQ ingestion.