Closed alxmrs closed 1 year ago
Data after running pipelines for mentioned cases.
File Size | main branch | mv-ba-fix-zarr (this branch) |
---|---|---|
~100Mb | 58 min. | 41min. |
~1Gb | 16h 30min. | 1h 12min. |
This changes are relatively fast than before for zarr batch ingestion.
weather-mv bq
's previous Zarr ingestion system only used one worker. This PR uses Xarray-Beam for Zarr ingestion, in order to distributedxr.Dataset
chunks across beam workers. This improves ingestion into BQ.Outstanding issues: I can't find a way to incrementally load rows into BQ from Zarr. While I've used windowing on fixed intervals to break up a large ingestion job into smaller parts, it seems like the actual writing to BQ gets stuck in a reshuffle step within the
WriteToBigQuery
transform. In this PR or a future PR, let's try to find a way to incrementally write rows to BQ once they've been processed, instead of having to wait for the entire dataset to be processed. CC: @dabhicusp.