google / weather-tools

Tools to make weather data accessible and useful.
https://weather-tools.readthedocs.io/
Apache License 2.0
216 stars 40 forks source link

`weather-mv` will ingest data into BQ from Zarr much faster. #357

Closed alxmrs closed 1 year ago

alxmrs commented 1 year ago

weather-mv bq's previous Zarr ingestion system only used one worker. This PR uses Xarray-Beam for Zarr ingestion, in order to distributed xr.Dataset chunks across beam workers. This improves ingestion into BQ.

Outstanding issues: I can't find a way to incrementally load rows into BQ from Zarr. While I've used windowing on fixed intervals to break up a large ingestion job into smaller parts, it seems like the actual writing to BQ gets stuck in a reshuffle step within the WriteToBigQuery transform. In this PR or a future PR, let's try to find a way to incrementally write rows to BQ once they've been processed, instead of having to wait for the entire dataset to be processed. CC: @dabhicusp.

DarshanSP19 commented 1 year ago

Data after running pipelines for mentioned cases.

File Size main branch mv-ba-fix-zarr (this branch)
~100Mb 58 min. 41min.
~1Gb 16h 30min. 1h 12min.

This changes are relatively fast than before for zarr batch ingestion.