openclimatefix / nwp-consumer

Microservice for consuming NWP data.
9 stars 3 forks source link

Modify dask to partition within init times as opposed to between them #135

Closed devsjc closed 3 months ago

devsjc commented 4 months ago

The initial spec of the consumer required it to download multiple days of data as a first class citizen. These datasets would contain many init times each with their own sets of files. As such in order to make that as quick a process as possible, I build the consumer to parallelise with dask across desired init times - so each was processed in parallel.

However, real-world usage of the consumer has indicated that we very rarely if ever download and convert multiple init times at once - instead, choosing to download single init times at a time and iterating through that. Additionally to this, sources such as ICON have a great many number of files per init time, the downloads of which as it stands don't gain the benefit of parallel computing.

As such I propose a refactor in the business logic of the consumer - to make DownloadSingleInitTime and ConvertSingleInitTime the new service first-class citizens (business use cases). This then enables the service to use dask parallelise within each init time and speed up the consumer in the way it is most-regularly used.

Before (verticality indicates parallel):

IT1 ---> download file 1 ---> download file 2 ---> convert file 1 ---> convert file 2 ---> zarr
IT2 ---> download file 1 ---> download file 2 ---> convert file 1 ---> convert file 2 ---> zarr
IT3 ---> download file 1 ---> download file 2 ---> convert file 1 ---> convert file 2 ---> zarr

After:

        / download file 1 ---> convert file 1 \                  / download file 1 --->
IT1 --->  download file 2 ---> convert file 2 ---> zarr -> IT2 ->  download file 2 ---> ...
        \ download file 3 ---> convert file 3 /                  \ download file 3 ---> 

IT2 is shown here for clarity, but as mentioned, most often the use case for the consumer is just a single init time, hence why the second option is the preferred new choice.

devsjc commented 4 months ago

Initial modification with https://github.com/openclimatefix/nwp-consumer/pull/134. This does not change the business language of the consumer for the moment for ease of testing and migration. I do think that the business language (and reflecting methods) of the NWPConsumer class however should pivot to the SingleInitTime variants considering this seems to be the primary use case of the consumer.