google / weather-tools

Tools to make weather data accessible and useful.
https://weather-tools.readthedocs.io/
Apache License 2.0
216 stars 40 forks source link

Improved efficiency of weather-mv bq in terms of time and cost. #473

Closed mahrsee1997 closed 2 months ago

mahrsee1997 commented 2 months ago

Stats:

Tested on dataset (single file) (longitude: 1440; latitude: 721; level: 13; time: 1; number of data variables: 10):

<xarray.Dataset>
Dimensions:                  (longitude: 1440, latitude: 721, level: 13, time: 1)
Coordinates:
  * longitude                (longitude) float32 0.0 0.25 0.5 ... 359.5 359.8
  * latitude                 (latitude) float32 -90.0 -89.75 ... 89.75 90.0
  * level                    (level) int32 50 100 150 200 ... 700 850 925 1000
  * time                     (time) timedelta64[ns] 06:00:00
    datetime                 (time) datetime64[ns] ...
Data variables:
    10m_u_component_of_wind  (time, latitude, longitude) float32 ...
    10m_v_component_of_wind  (time, latitude, longitude) float32 ...
    2m_temperature           (time, latitude, longitude) float32 ...
    mean_sea_level_pressure  (time, latitude, longitude) float32 ...
    geopotential             (time, level, latitude, longitude) float32 ...
    specific_humidity        (time, level, latitude, longitude) float32 ...
    temperature              (time, level, latitude, longitude) float32 ...
    u_component_of_wind      (time, level, latitude, longitude) float32 ...
    v_component_of_wind      (time, level, latitude, longitude) float32 ...
    vertical_velocity        (time, level, latitude, longitude) float32 ...

Total number of rows to be ingested into BQ: 13,497,120 (1440 721 13 *1)

Ran on DataflowRunner. machine_type: n1-standard1; 4gb RAM; 100GB HDD.

Branch Time Taken Cost Autoscaled max to
main 48 min $9.33 625 workers
mv-opitimization 36 min $0.10 6 workers

Note: In our development project, we have 1000 workers (with no resource restrictions). However, in a real-world scenario, users might not have this many workers, so the time and cost with the main branch would have been significantly higher.

Approach:

Assumption: A minimum of this much memory is available to load all the data variables for (lat × lon) plus a single indexed (apart from lat & lon) at once. I think we can make this assumption because, for a 0.1 resolution dataset (3600 × 1800) with 51 data variables, only 9 GiB of RAM is required.

ps: From the learnings of ARCO-ERA5 to BQ ingestion.

alxmrs commented 2 months ago

Since I’m not employed at Google right now, I can’t in good conscience give this an approval. I think @fredzyda would be a better person to decide if this should be merged. I will say, this patch looks good to me.

mahrsee1997 commented 2 months ago

Thanks @alxmrs and @fredzyda for the review!