purarue / google_takeout_parser

A library/CLI tool to parse data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)
https://pypi.org/project/google-takeout-parser/
MIT License
82 stars 14 forks source link

failure during `_warn_if_no_activity()` on windows #77

Open karlicoss opened 1 month ago

karlicoss commented 1 month ago

So windows pipelines for promnesia started failing during takeout tests (https://github.com/karlicoss/promnesia/actions/runs/11003856666/job/30553708107) with 'missing ), unterminated subpattern at position 16'

I debugged, and it's happending during warn_if_no_activity call https://github.com/seanbreckenridge/google_takeout_parser/blob/a1b5dce203fc99d32816e80c5e991b4ad2888c8a/google_takeout_parser/path_dispatch.py#L181

This is because here some ungodly things are happening to the regexes :D https://github.com/seanbreckenridge/google_takeout_parser/blob/a1b5dce203fc99d32816e80c5e991b4ad2888c8a/google_takeout_parser/locales/main.py#L49-L53

So after my recent change to location regexes https://github.com/seanbreckenridge/google_takeout_parser/commit/60e230e55e4b3a51836f9207c20e167cbe128af2, now one of prefixes ends up as 'Location History( ', which is an invalid regex

This in turn is because under windows, Path would use \ as the separator, so we end up with this

(Pdb) Path(r"Location History( \(Timeline\))?/Records.json").parts  
('Location History( ', '(Timeline', '))?', 'Records.json')

(also tests running this code are suppressed on windows hehe) https://github.com/seanbreckenridge/google_takeout_parser/blob/a1b5dce203fc99d32816e80c5e991b4ad2888c8a/tests/test_locale_paths.py#L8

So I see a few options to resolve this:

Let me know what you think, happy to fix! Perhaps that would even unbork that windows test!

purarue commented 1 month ago

in principle there is no reason to do this hacky regex in _warn_if_no_activity at all unless I'm missing something... Handlers are properly built against full paths, so that would be way more reliable.

Yeah, Im thinking you can probably remove the hacky regex check here. The purpose was mostly to warn a user when they're first trying to parse an export from google, that they might be passing the wrong directory.

Perhaps it would be useful to move it just to the CLI code in __main__.py and remove it from lib code altogether?

karlicoss commented 1 month ago

Yeah, I don't have strong opinion about exposing it in the library. We could probably achieve similar results by using @warn_if_empty decorator in HPI? The only tricky thing (regardless whether we're passing expected= to match_structure, or using warn_if_empty) is that sometimes takeout dirs are split in multiple zip archives (with -001/-002/-003 etc suffixes) -- zip has some limitations on archive size. They are valid archives, and google_takeout_parser will process them fine, but we will get warnings if no data was extracted from one of such archives (just because it happened to contain no useful data, or has no data google_takeout_parser supports at the moment).

For now just submitted a quick fix to unblock Promnesia CI pipeline!

purarue commented 1 month ago

Yeah, it's mostly just supposed to be a helpful warning, and at the time i assumed there was no downside, but it may be over-reporting the error in the split archive case.

Will think about removing it or just moving it to CLI code, warning if the parsed results are empty.

Feel free to leave this issue open to track that.

purarue commented 3 weeks ago

moved to just when the parse command is called, to warn the user the first time they might be parsing https://github.com/seanbreckenridge/google_takeout_parser/commit/b53a8b2c7c5809832a0ed488afc0fa17b7d3fbd2

I think this can be closed? let me know if it has to stay open