openclimatefix / ocf_datapipes

OCF's DataPipe based dataloader for training and inference
MIT License
13 stars 11 forks source link

Use ThreadPoolMapper to request all of a batch at once in each worker #177

Closed jacobbieker closed 9 months ago

jacobbieker commented 1 year ago

Even with smaller chunks, the satellite data is quite slow to access from GCP, same with NWP. As loading is quite fast off a local disk, it might be more of a latency issue. One possible way around that is to use the ThreadPoolMapper to request multiple SpaceTimeLocations all at once in each worker

Detailed Description

Docs are here

The datapipe would then be in charge of making the batch, so that might require some changes to how our training datapipes are currently constructed. And we should make sure that each worker makes a full batch before its returned. ( I think they do as during training, the batches come in all at once in bursts of num_workers batches), otherwise this might not make as much of a difference.

jacobbieker commented 1 year ago

One option could be to put the threadpool map on the output from the select spatial slice? And the function being .load() as then it should load the requested data in parallel for multiple examples, without needing to change too much of the pipeline

jacobbieker commented 1 year ago

Turns out Threadpool Map is in torchdata 0.7, so not released yet, so have to copy in the source. But seems to speed up by a bit in GCP. Loading with 1 worker, batch of 16, non-HRV,+Topo+Sun: Without threading: ~300seconds per iteration With 8 threads per sat/nwp/hrv right after the spatial and time slicing that loads xarray into memory: ~180-200 seconds per iteration

jacobbieker commented 1 year ago

181 was also related, and added more built-in functionality to use inside other datapipes