spedas / pyspedas

Python-based Space Physics Environment Data Analysis Software
https://pyspedas.readthedocs.io/
MIT License
147 stars 58 forks source link

Unnecessary os.walk in download.py make it extremely slow #768

Open xnchu opened 6 months ago

xnchu commented 6 months ago

Issue at line 422 in pyspedas/utilities/download.py `for dirpath, dirnames, filenames in os.walk(local_path_to_search):' This line loops through the full directory list in local_path_to_search, which is usually a large data folder. It will go through each file in this directory, and check if it matches a desired file name. This is problematic when I want to load 1000 days of local files. It will run 1000 x 1000 = 1,000,000 times of os.walk, which takes for ever. Proposed solution: instead of searching the whole directory for a file (e.g., c:/data/omni_20130101.cdf), the absolute path of the file is already obtained as filename (line 333). If filename exists, use it; if not, it is unnecessary to search for the whole directory.

Thanks a lot.

jameswilburlewis commented 6 months ago

I think the situation is a bit complicated, because the list of URLs passed to download() may contain wildcard or regex expressions, and may also map to multiple local directory paths. So it's not quite as simple as using the filename value from line 333. But maybe there are optimizations we can make to reduce the amount of filesystem traversal -- we'll take a look. Thanks for letting us know!

Beforerr commented 6 months ago

I tried to improve this before, but gave up latter because of wildcard parsing and did not quite understand the logic of download function. It would be nice to decompose the function a little bit to make it easier for developers to improve specific parts.