Open karlicoss opened 1 month ago
in principle there is no reason to do this hacky regex in _warn_if_no_activity at all unless I'm missing something... Handlers are properly built against full paths, so that would be way more reliable.
Yeah, Im thinking you can probably remove the hacky regex check here. The purpose was mostly to warn a user when they're first trying to parse an export from google, that they might be passing the wrong directory.
Perhaps it would be useful to move it just to the CLI code in __main__.py
and remove it from lib code altogether?
Yeah, I don't have strong opinion about exposing it in the library. We could probably achieve similar results by using @warn_if_empty
decorator in HPI?
The only tricky thing (regardless whether we're passing expected=
to match_structure
, or using warn_if_empty
) is that sometimes takeout dirs are split in multiple zip archives (with -001/-002/-003 etc suffixes) -- zip has some limitations on archive size. They are valid archives, and google_takeout_parser
will process them fine, but we will get warnings if no data was extracted from one of such archives (just because it happened to contain no useful data, or has no data google_takeout_parser supports at the moment).
For now just submitted a quick fix to unblock Promnesia CI pipeline!
Yeah, it's mostly just supposed to be a helpful warning, and at the time i assumed there was no downside, but it may be over-reporting the error in the split archive case.
Will think about removing it or just moving it to CLI code, warning if the parsed results are empty.
Feel free to leave this issue open to track that.
moved to just when the parse
command is called, to warn the user the first time they might be parsing
https://github.com/seanbreckenridge/google_takeout_parser/commit/b53a8b2c7c5809832a0ed488afc0fa17b7d3fbd2
I think this can be closed? let me know if it has to stay open
So windows pipelines for promnesia started failing during takeout tests (https://github.com/karlicoss/promnesia/actions/runs/11003856666/job/30553708107) with
'missing ), unterminated subpattern at position 16'
I debugged, and it's happending during
warn_if_no_activity
call https://github.com/seanbreckenridge/google_takeout_parser/blob/a1b5dce203fc99d32816e80c5e991b4ad2888c8a/google_takeout_parser/path_dispatch.py#L181This is because here some ungodly things are happening to the regexes :D https://github.com/seanbreckenridge/google_takeout_parser/blob/a1b5dce203fc99d32816e80c5e991b4ad2888c8a/google_takeout_parser/locales/main.py#L49-L53
So after my recent change to location regexes https://github.com/seanbreckenridge/google_takeout_parser/commit/60e230e55e4b3a51836f9207c20e167cbe128af2, now one of prefixes ends up as
'Location History( '
, which is an invalid regexThis in turn is because under windows,
Path
would use\
as the separator, so we end up with this(also tests running this code are suppressed on windows hehe) https://github.com/seanbreckenridge/google_takeout_parser/blob/a1b5dce203fc99d32816e80c5e991b4ad2888c8a/tests/test_locale_paths.py#L8
So I see a few options to resolve this:
path.split('/')[0]
instead ofPath.parts(path)[0]
, this is the simplest and I think will just work?re.match
inre.match(activity_dir, str(p)):
defensive -- this probably makes sense considering how hacky the whole thing is, and this will still be consistent with the "any regex matches" logic. Although with all the takeout directory renaming it's possible there won't be a single valid regex there in the future_warn_if_no_activity
at all unless I'm missing something -- could just check values ofself.handlers
against_parse_json_activity, _parse_location_history, _parse_chrome_history
? Handlers are properly built against full paths, so that would be way more reliable. It's still used in HPI though for match_structure, so I guess we want to fixget_paths_for_functions
properly https://github.com/karlicoss/HPI/blob/a8f86e32b981aef62890605e12da9cd59c9cc0c8/my/google/takeout/parser.py#L69-L81 Although it's only used during match_structure processing, I usually just use zip files directly (which don't need the EXPECTED thing), so might move this call under a lru_cache wrapped function, so it's only called when it's actually necessaryLet me know what you think, happy to fix! Perhaps that would even unbork that windows test!