purarue / google_takeout_parser

A library/CLI tool to parse data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)
https://pypi.org/project/google-takeout-parser/
MIT License
82 stars 14 forks source link

DE locale is guessed instead of EN due to a missing `Semantic Location History` regex #69

Open karlicoss opened 1 month ago

karlicoss commented 1 month ago

So I noticed with my latest takeouts, sometimes DE locale is guessed which results in a lot of stuff not being parsed (like "My activity")

After a bit of debugging it seems that DE locale is getting higher score because it matches r"Location History \(Timeline\)/Semantic Location History/.*/.*.json" Whereas in EN locale it's r"Location History/Semantic Location History/.*/.*.json".

Because semantic location history contains so many files, it basically trumps over over any other matches and locale switched to DE :sweat_smile:

So there are a few possible things to fix/improve:

  1. regexes for location history in en locale

    In my case, locations are present in:

    • Takeout/Location History/Location History.json (circa Jan 2020)
    • Takeout/Location History/Records.json (circa Apr 2022)
    • Takeout/Location History (Timeline)/Records.json (circa Feb 2024)

    Semantic Location History/ and Settings.json are also in the same dirs (since around 2021)

    Feels like the easiest would be to just use Location History( \(Timeline\))? in all location regexes? Happy to do the change.

  2. another part of the issue is that there is some duplication here:

    https://github.com/seanbreckenridge/google_takeout_parser/blob/master/google_takeout_parser/locales/en.py#L44-L53 https://github.com/seanbreckenridge/google_takeout_parser/blob/master/google_takeout_parser/locales/de.py#L14-L21

    , so if we update one of them it's easy to miss the other one.

    Maybe worth extracting common bits (to common.py?) and reusing them both in en and de? Maybe they are just named the same no matter locale (who knows?), but if not can think about it later. Also happy to PR!

  3. since there are so many different files under semantic history, they dominate the score

    I guess same would apply to any similar globs in the handler map.. not sure what's a good way to fix it.

    Do we actually need to pick the 'winning' handler map? Perhaps could just go through all of them in order and try all?

    If it doesn't match a locale, then there is no harm. It would also fit nicely with extracting common locale bits, so you could have 'shared' locale (or just en.py :) ), and the rest are just extending.

purarue commented 1 month ago

Ah, I've stopped tracking my location history with google, so probably didn't notice this change.

I think you should be able to temporarily override, but agreed its something that could be improved.

Feels like the easiest would be to just use Location History( (Timeline))? in all location regexes? Happy to do the change.

Yeah, you can update them so that the EN matches the regexes, I wasnt aware they had changed.

Maybe worth extracting common bits (to common.py?) and reusing them both in en and de?

Yeah... we can do this, it was something I considered, but I thought it would be nice if you were looking at EN that it was all in the one file. Its not a huge deal to have to jump between two (en.py and a common.py or a de.py and a common.py)

Maybe they are just named the same no matter locale (who knows?) so if we update one of them it's easy to miss the other one.

In general, this is sort of a hard problem unless we have contributors/users who are using each locale. I don't know if we can assume if the folder name changes in one locale it has in all of them (maybe its fine to, but I don't trust google to be consistent), and switching locales and having to wait till the takeout finishes processing is not a quick thing to do.

Do we actually need to pick the 'winning' handler map? Perhaps could just go through all of them in order and try all?

This could probably work as well...? Would probably just need to change the logs a bit, but it looks like the dispatch_map already allows you to pass multiple handlers. I'm not totally opposed to just passing it every locale. Is not elegant, but it might be the least confusing and work the best (as long as Im not totally missing some reason not to)

Am leaning more towards option 3 than 2 because it means we don't have to create a shared file and make sure that remains updated between different locales?

I'm willing to go with either fix though, if you want to try whichever feels more correct to you and PR that would be great.

karlicoss commented 1 month ago

I think you should be able to temporarily override

Yep, I found the env variable! But considering the default is guessing and most users would expect EN locale, it should really work out of the box

switching locales and having to wait till the takeout finishes processing is not a quick thing to do.

I wouldn't even expect it to work like that straightaway lol, might take forever to propagate. Some parts of google's interface are still in Russian for me or I get random service emails in Russian from google even though have Russian nowhere in the interface..

Is not elegant, but it might be the least confusing and work the best

Yep agree! I will trial it for a bit and see how it goes (maybe will swap so DE goes first as this would test a more interesting codepath). In the meantime will PR a fix to fix regexes as a short term fix