openclimatefix / Satip

Satip contains the code necessary for retrieving, transforming and storing EUMETSAT data
https://satip.readthedocs.io/
MIT License
41 stars 29 forks source link

speed up download_tailored_datasets by adding option to parallel downloads #250

Closed JasonFengGit closed 3 months ago

JasonFengGit commented 3 months ago

Pull Request

Description

Trying to speed up download_tailored_datasets by adding an option to parallel downloads using concurrent.futures

Fixes https://github.com/openclimatefix/Satip/issues/144

With parallelization: image

How Has This Been Tested?

Checklist:

JasonFengGit commented 3 months ago

Hi @jacobbieker, I've fixed the lint errors. However, I'm not sure if we should merge this into production before conducting more tests. I'm concerned about the potential memory overhead that might occur when running multiple downloads simultaneously.

I'll conduct some additional tests to see how it performs. Could you think of any scenarios where this could go wrong?

jacobbieker commented 3 months ago

The data tailor only takes 3 jobs at a time, so potentially limiting it to a concurrency of 3 would be great, and keeps memory lower. Making it configurable is probably the best way forward for it.

JasonFengGit commented 3 months ago

I see! Thanks! I'll implement that.

JasonFengGit commented 3 months ago

Hi @jacobbieker, I've added a concurrency param to limit the parallel downloads.

One concern I have is that a new test I added for parallel downloads takes about 8 minutes on my machine. This increases the total test time (which is already lengthy) from approximately 30 minutes to about 40 minutes. I am not sure if I should replace the original test_data_tailor, though. Thanks!

JasonFengGit commented 3 months ago

Hi @jacobbieker, I updated the code (remove the non-parallel approach and test, concurrency defaulted to 1) and tested on my machine. Thx!

JasonFengGit commented 3 months ago

Hi @jacobbieker, speaking of testing, there is another reason for the long testing time: https://github.com/openclimatefix/Satip/issues/157#issuecomment-2023223214. Could you provide some suggestions for the implementation here? Thanks!

jacobbieker commented 3 months ago

Hi, yeah, that sleep can probably be removed now, there can still be rate limiting, which that one is supposed to fix, but it seems to happen less often now. I would still leave the other sleeps that are around it which still would help mitigate hitting EUMETSAT's servers all at the same time

JasonFengGit commented 3 months ago

I see! Could you review this pr or approve this https://github.com/openclimatefix/Satip/pull/247 so I can make a seperate PR? Thanks!