Improved efficiency of weather-mv bq in terms of time and cost.

mahrsee1997 commented 2 months ago

Stats:

Tested on dataset (single file) (longitude: 1440; latitude: 721; level: 13; time: 1; number of data variables: 10):

<xarray.Dataset>
Dimensions:                  (longitude: 1440, latitude: 721, level: 13, time: 1)
Coordinates:
  * longitude                (longitude) float32 0.0 0.25 0.5 ... 359.5 359.8
  * latitude                 (latitude) float32 -90.0 -89.75 ... 89.75 90.0
  * level                    (level) int32 50 100 150 200 ... 700 850 925 1000
  * time                     (time) timedelta64[ns] 06:00:00
    datetime                 (time) datetime64[ns] ...
Data variables:
    10m_u_component_of_wind  (time, latitude, longitude) float32 ...
    10m_v_component_of_wind  (time, latitude, longitude) float32 ...
    2m_temperature           (time, latitude, longitude) float32 ...
    mean_sea_level_pressure  (time, latitude, longitude) float32 ...
    geopotential             (time, level, latitude, longitude) float32 ...
    specific_humidity        (time, level, latitude, longitude) float32 ...
    temperature              (time, level, latitude, longitude) float32 ...
    u_component_of_wind      (time, level, latitude, longitude) float32 ...
    v_component_of_wind      (time, level, latitude, longitude) float32 ...
    vertical_velocity        (time, level, latitude, longitude) float32 ...

Total number of rows to be ingested into BQ: 13,497,120 (1440 721 13 *1)

Ran on DataflowRunner. machine_type: n1-standard1; 4gb RAM; 100GB HDD.

Branch	Time Taken	Cost	Autoscaled max to
main	48 min	$9.33	625 workers
mv-opitimization	36 min	$0.10	6 workers

Note: In our development project, we have 1000 workers (with no resource restrictions). However, in a real-world scenario, users might not have this many workers, so the time and cost with the main branch would have been significantly higher.

Approach:

We calculate latitude, longitude, geo_point, and geo_polygon information upfront and dump it to a parquet file so that we do not need to process it every time we process a set of files.
Previously, we created indexes across all the index dimensions (e.g., lat, lon, time, level) and then selected rows from the dataset based on these coordinates. This resulted in a high number of I/O calls.
Now, we only create indexes across all the index dimensions except for latitude and longitude, thereby reducing the number of coordinates and, consequently, the number of I/O calls.
We use pandas DataFrame and its methods to generate rows instead of iterating over each row with a for loop.
Using --rows_chunk_size <chunk-size>, users can control the number of rows loaded into memory for processing, depending on their system's memory.

Assumption: A minimum of this much memory is available to load all the data variables for (lat × lon) plus a single indexed (apart from lat & lon) at once. I think we can make this assumption because, for a 0.1 resolution dataset (3600 × 1800) with 51 data variables, only 9 GiB of RAM is required.

ps: From the learnings of ARCO-ERA5 to BQ ingestion.

alxmrs commented 2 months ago

Since I’m not employed at Google right now, I can’t in good conscience give this an approval. I think @fredzyda would be a better person to decide if this should be merged. I will say, this patch looks good to me.

mahrsee1997 commented 2 months ago

Thanks @alxmrs and @fredzyda for the review!

google / weather-tools

Improved efficiency of weather-mv bq in terms of time and cost. #473

Stats:

Approach: