purarue / google_takeout_parser

A library/CLI tool to parse data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)
https://pypi.org/project/google-takeout-parser/
MIT License
82 stars 14 forks source link

Takeout folders are localized according to accounts main language #43

Closed parthux1 closed 9 months ago

parthux1 commented 1 year ago

problem

If an account doesn't have english as its main language, folders and some files are named differently (localized). This results into no parsed folders due to _match_handler misses.

possible solution

I think it would be beneficial to add default handler maps like defined here (DEFAULT_HANDLER_MAP) for other languages.
An user could select a HandlerMap via command line argument.

If you approve this idea I could work on a pull request for adding a german localization as well as a command line argument for selecting a handler map.

If there's already an option to achieve this please fill me in :)

purarue commented 1 year ago

Oh, interesting

Yeah, theres no way to do that currently. Could you post an example of the folder structure here?

Would probably just be an constructor arg to the TakeoutParser class, with a option in the shared options in __main__.py to let the user specify

I think a flag makes sense incase the user wants to specify, but I can add the automatic detection afterwards, have a good idea on how to do that.

PR would be appreciated, thanks; let me know if you have any questions/issues

purarue commented 1 year ago

@karlicoss as an FYI, this could break the EXPECTED in google.takeout.parser https://github.com/karlicoss/HPI/blob/8288032b1c185bda2ddae6b3a956e87d43314604/my/google/takeout/parser.py#L70-L76, as it searches for the english names

As a quick fix could see letting the user override the EXPECTED for match_structure with their config, will create a PR to HPI once this has settled

parthux1 commented 1 year ago

! Currently running a bigger exploit. Will update once it's finished.

Comparing to your testdata:

non mentioned english dirs/files are similar for german localization

/Takeout

English German
Chrome
Google Play Store
Location History Standortverlauf
My Activity Meine Aktivitäten
Youtube and Youtube Music
archive_browser.html Archiv_Übersicht.html

/ My Activity

English German
Ads Anzeigen
Google Analytics ?
Google Apps Google Spiele
Google Cloud ?
Google Translate Google Übersetzer
Help Hilfe
Image Search Bildersuche
Podcasts ?
Video Search Videosuche
in each subfolder:
MyActivity.json
MeineAktivitäten.json

(probably not up-to-date in testdata)

English German
Assistant Google Assistant
Developers Google Developers
News Google News
Search Google Suche
purarue commented 1 year ago

Great, thanks, looks different enough that detection shouldnt be a problem :+1:

Feel free to restructure path_handler as you see fit, perhaps could add a locales/de.py folder/file, following ISO 3166

And then at the top of path_handler.py can import from .locales.de import HANDLER_MAP as GERMAN_HANDLER_MAP.

May want to update the setup.py file as well to ensure that subpackage is included:

diff --git a/setup.py b/setup.py
index d459510..e8bccfd 100644
--- a/setup.py
+++ b/setup.py
@@ -18,7 +18,7 @@ setup(
     long_description_content_type="text/markdown",
     license="MIT",
     packages=find_packages(
-        include=["google_takeout_parser", "google_takeout_parser.parse_html"]
+        include=["google_takeout_parser", "google_takeout_parser.parse_html, google_takeout_parser.locales"]
     ),
     install_requires=reqs,
     package_data={pkg: ["py.typed"]},

I'll do an export myself just to confirm the current names for me

purarue commented 1 year ago

Just did a new export, for reference if you wanted to compare:

.
├── archive_browser.html
├── Chrome
│   ├── Autofill.json
│   ├── Bookmarks.html
│   ├── BrowserHistory.json
│   ├── Device Information.json
│   ├── Dictionary.csv
│   ├── Extensions.json
│   ├── Omnibox.json
│   ├── OS Settings.json
│   ├── SearchEngines.json
│   └── SyncSettings.json
├── Google Play Store
│   ├── Devices.json
│   ├── Installs.json
│   ├── Library.json
│   ├── Play Settings.json
│   ├── Purchase History.json
│   └── Reviews.json
├── Location History
│   ├── Records.json
│   ├── Semantic Location History
│   │   └── 2023
│   │       ├── 2023_FEBRUARY.json
│   │       ├── 2023_JANUARY.json
│   │       └── 2023_MARCH.json
│   └── Settings.json
├── My Activity
│   ├── Ads
│   │   └── MyActivity.json
│   ├── Android
│   │   └── MyActivity.json
│   ├── Assistant
│   │   └── MyActivity.json
│   ├── Books
│   │   └── MyActivity.json
│   ├── Developers
│   │   └── MyActivity.json
│   ├── Discover
│   │   └── MyActivity.json
│   ├── Drive
│   │   └── MyActivity.json
│   ├── Gmail
│   │   └── MyActivity.json
│   ├── Google Analytics
│   │   └── MyActivity.json
│   ├── Google Arts _ Culture
│   │   └── MyActivity.json
│   ├── Google Cloud
│   │   └── MyActivity.json
│   ├── Google Lens
│   │   └── MyActivity.json
│   ├── Google Play Movies _ TV
│   │   └── MyActivity.json
│   ├── Google Play Store
│   │   └── MyActivity.json
│   ├── Google Store
│   │   └── MyActivity.json
│   ├── Google Translate
│   │   └── MyActivity.json
│   ├── Help
│   │   └── MyActivity.json
│   ├── Image Search
│   │   └── MyActivity.json
│   ├── Maps
│   │   └── MyActivity.json
│   ├── News
│   │   └── MyActivity.json
│   ├── Podcasts
│   │   └── MyActivity.json
│   ├── Search
│   │   └── MyActivity.json
│   ├── Shopping
│   │   └── MyActivity.json
│   ├── Takeout
│   │   └── MyActivity.json
│   ├── Video Search
│   │   └── MyActivity.json
│   └── YouTube
│       └── MyActivity.json
└── YouTube and YouTube Music
    ├── history
    │   ├── search-history.json
    │   └── watch-history.json
    ├── my-comments
    │   └── my-comments.html
    ├── my-live-chat-messages
    │   └── my-live-chat-messages.html
    ├── playlists
    │   ├── Favorites.csv
    │   └── Liked videos.csv
    └── subscriptions
        └── subscriptions.csv

39 directories, 55 files
parthux1 commented 1 year ago

Also just copying the tree containing folders of your export here for reference: The actual mapping according to your current Mapping dict will be part of the PR.

|   Archiv_Übersicht.html
|
+---Chrome
|       Autofill.json
|       Bookmarks.html
|       BrowserHistory.json
|       Device Information.json
|       Dictionary.csv
|       Extensions.json
|       Omnibox.json
|       OS Settings.json
|       SearchEngines.json
|       SyncSettings.json
|
+---Google Play Store
|       Devices.json
|       Installs.json
|       Library.json
|       Order History.json
|       Play Settings.json
|       Promotion History.json
|       Purchase History.json
|       Reviews.json
|
+---Standortverlauf
|   |   Records.json
|   |   Settings.json
|   |
|   \---Semantic Location History
|       +---2022
|       |       2022_DECEMBER.json
|       |       2022_NOVEMBER.json
|       |
|       \---2023
|               2023_FEBRUARY.json
|               2023_JANUARY.json
|               2023_MARCH.json
|
+---Meine Aktivitäten
|   +---Android
|   |       MeineAktivitäten.html
|   |
|   +---Anzeigen
|   |       MeineAktivitäten.html
|   |
|   +---Assistant Memory
|   |       MeineAktivitäten.html
|   |
|   +---Bildersuche
|   |       MeineAktivitäten.html
|   |
|   +---Books
|   |       MeineAktivitäten.html
|   |
|   +---Chrome
|   |       MeineAktivitäten.html
|   |
|   +---Datenexport
|   |       MeineAktivitäten.html
|   |
|   +---Discover
|   |       MeineAktivitäten.html
|   |
|   +---Drive
|   |       MeineAktivitäten.html
|   |
|   +---Gmail
|   |       MeineAktivitäten.html
|   |
|   +---Google Assistant
|   |       MeineAktivitäten.html
|   |
|   +---Google Developers
|   |       MeineAktivitäten.html
|   |
|   +---Google Lens
|   |       MeineAktivitäten.html
|   |
|   +---Google News
|   |       MeineAktivitäten.html
|   |
|   +---Google Play Filme _ Serien
|   |       MeineAktivitäten.html
|   |
|   +---Google Play Spiele
|   |       MeineAktivitäten.html
|   |
|   +---Google Play Store
|   |       MeineAktivitäten.html
|   |
|   +---Google Store
|   |       MeineAktivitäten.html
|   |
|   +---Google Suche
|   |       MeineAktivitäten.html
|   |
|   +---Google Übersetzer
|   |       MeineAktivitäten.html
|   |
|   +---Hilfe
|   |       MeineAktivitäten.html
|   |
|   +---Maps
|   |       MeineAktivitäten.html
|   |
|   +---Shopping
|   |       MeineAktivitäten.html
|   |
|   +---Videosuche
|   |       MeineAktivitäten.html
|   |
|   \---YouTube
|           MeineAktivitäten.html
|
\---YouTube und YouTube Music
    +---Abos
    |       Abos.csv
    |
    +---Meine Kommentare
    |       Meine Kommentare.html
    |
    +---meine-live-chat-nachrichten
    |       meine-live-chat-nachrichten.html
    |
    +---musik-mediathek-songs
    |       musik-mediathek-songs.csv
    |
    +---Playlists
    |       [names of playlist].csv
    |       Uploads from [channel name].csv
    |
    +---Verlauf
    |       Suchverlauf.html
    |       Wiedergabeverlauf.html
    |
    \---Videos
            [name of uploaded video].mp4
            Video-Metadaten.csv
purarue commented 1 year ago

If you could clone and test the DE locale that would be great. I did an export myself with a secondary google account but i dont have as much data:

git clone https://github.com/seanbreckenridge/google_takeout_parser
cd google_takeout_parser
python3 -m pip install .
python3 -m google_takeout_parser --verbose parse Takeout -a summary

It should guess that its de based on the files present, but if it doesnt you can specify --locale DE

parthux1 commented 10 months ago

Thanks for your progress on this issue. I'm sorry for responding after several months but it looks like git didn't inform me about updates on this thread. I'm subscribed to this issue but the next tine you may throw a @parthux1 in your message.

Referencing the log it looks like some folder names changed in the meantime.

I tried editing locales/de.py locally to reflect these changes but after using pips --force-reinstall and installing the module with these changes into a new venv my dict changes weren't used (the same output was generated).

My changes to de.py

-    r"Standortverlauf/Semantic Location History/.*/.*.json": _parse_semantic_location_history,
+    r"Location History (Timeline)/Semantic Location History/.*/.*.json": _parse_semantic_location_history,

Log

❯ google_takeout_parser --verbose parse Takeout -a summary --locale DE
[D 231228 16:02:54 path_dispatch:200] User specified locale: DE
[D 231228 16:02:54 path_dispatch:203] Using locale DE. To override set, GOOGLE_TAKEOUT_PARSER_LOCALE
[D 231228 16:02:54 path_dispatch:248] Trying to match one of: ['Chrome', 'Location History', 'Meine Aktivitäten', 'My Activity', 'Standortverlauf', 'YouTube( and YouTube Music)?', 'YouTube( und YouTube Music)?']
[D 231228 16:02:54 path_dispatch:256] Matched expected directory: Location History
[W 231228 16:02:54 path_dispatch:331] No function to handle parsing Location History (Timeline)/Records.json
[W 231228 16:02:54 path_dispatch:331] No function to handle parsing Location History (Timeline)/Semantic Location History/2022/2022_DECEMBER.json
[W 231228 16:02:54 path_dispatch:331] No function to handle parsing Location History (Timeline)/Semantic Location History/2022/2022_NOVEMBER.json
[W 231228 16:02:54 path_dispatch:331] No function to handle parsing Location History (Timeline)/Semantic Location History/2023/2023_APRIL.json
[W 231228 16:02:54 path_dispatch:331] No function to handle parsing Location History (Timeline)/Semantic Location History/2023/2023_AUGUST.json
[W 231228 16:02:54 path_dispatch:331] No function to handle parsing Location History (Timeline)/Semantic Location History/2023/2023_DECEMBER.json
[W 231228 16:02:54 path_dispatch:331] No function to handle parsing Location History (Timeline)/Semantic Location History/2023/2023_FEBRUARY.json
[W 231228 16:02:54 path_dispatch:331] No function to handle parsing Location History (Timeline)/Semantic Location History/2023/2023_JANUARY.json
[W 231228 16:02:54 path_dispatch:331] No function to handle parsing Location History (Timeline)/Semantic Location History/2023/2023_JULY.json
[W 231228 16:02:54 path_dispatch:331] No function to handle parsing Location History (Timeline)/Semantic Location History/2023/2023_JUNE.json
[W 231228 16:02:54 path_dispatch:331] No function to handle parsing Location History (Timeline)/Semantic Location History/2023/2023_MARCH.json
[W 231228 16:02:54 path_dispatch:331] No function to handle parsing Location History (Timeline)/Semantic Location History/2023/2023_MAY.json
[W 231228 16:02:54 path_dispatch:331] No function to handle parsing Location History (Timeline)/Semantic Location History/2023/2023_NOVEMBER.json
[W 231228 16:02:54 path_dispatch:331] No function to handle parsing Location History (Timeline)/Semantic Location History/2023/2023_OCTOBER.json
[W 231228 16:02:54 path_dispatch:331] No function to handle parsing Location History (Timeline)/Semantic Location History/2023/2023_SEPTEMBER.json
[W 231228 16:02:54 path_dispatch:331] No function to handle parsing Location History (Timeline)/Settings.json
[W 231228 16:02:54 path_dispatch:331] No function to handle parsing Location History (Timeline)/Timeline Edits.json
[W 231228 16:02:54 path_dispatch:331] No function to handle parsing Maps (Meine Orte)/Bewertungen.json
[W 231228 16:02:54 path_dispatch:331] No function to handle parsing Maps (Meine Orte)/Gespeicherte Orte.json
Counter()
purarue commented 10 months ago

Ah thanks :+1: @parthux1

If there are any changes from your end once you have the new export, feel free to make a PR and update the de.py file