purarue / google_takeout_parser

A library/CLI tool to parse data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)
https://pypi.org/project/google-takeout-parser/
MIT License
82 stars 14 forks source link

path_dispatch: speedup dispatch map about 80% #72

Closed karlicoss closed 1 month ago

karlicoss commented 1 month ago

This could take 5+ seconds if you have lots of files in your takeout (e.g. 50K+), which may take longer than actually parsing whatever you wanted! Most of these aren't even handled at the moment (like mp3 files form Assistant or Google Drive), but they still slow down everything esle.

After optimizations takes <1s on my system.

karlicoss commented 1 month ago

Kind of more elaborate than other optimizations :sweat_smile: But feels wrong for TakeoutParser class to instantiate for longer than actual parsing!

I think ideally we'd not process these files at all (even not match them against the handler map), if we could exclude parent directory somehow here https://github.com/seanbreckenridge/google_takeout_parser/blob/master/google_takeout_parser/locales/en.py#L68 E.g. we could have an entry r"My Activity/Assistant/": None, and that would mean "do not go inside the directory at all" -- but for that to work we'd need to .walk rather than .glob paths anyway. However one issue is that we'd need that for every locale, so this is kinda relevant to https://github.com/seanbreckenridge/google_takeout_parser/issues/69 :sweat_smile: -- if we had a single handler map, such excludes would be easier!

purarue commented 1 month ago

Yep, makes a lot of sense

Thanks for the improvements by the way.

Not saying I'm always content with the stuff being slow, I just often find it hard to motivate myself to put extra work into google_takeout_parser because I'm just waiting for google to break something and having to fix it.

Appreciate it 💙