purarue / google_takeout_parser

A library/CLI tool to parse data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)
https://pypi.org/project/google-takeout-parser/
MIT License
82 stars 14 forks source link

Handler map localization #44

Closed parthux1 closed 1 year ago

parthux1 commented 1 year ago

PR for Issue #43.

Structure

  1. added folder locales containing 1.1 common.py - defines several TakeoutFiles(Enum), which are resolved to a parser later 1.2 en.py, de.py - HandlerMap from path_dispatch moved here. Each module defines such a map and links filepaths to TakeoutFiles from common.py

  2. added path_handler.py which 2.1 resolves enum TakeoutFile to a callable (prev. happened directly in the HandlerMap) 2.2 groups localized HandlerMaps (en.py, de.py) into the class Localizedhandlers

due to changes on class TakeoutParser you can e.g do

from google_takeout_parser.path_dispatch import TakeoutParser, LocalizedHandler
tp = TakeoutParser('path/to/takeout', handlers=[LocalizedHandler.EN(), LocalizedHandler.DE()])

this would try to map each encountered file to the EN map, and if not resolved to the DE map afterwards.

Details following:

Changes to existing code

I think these features compliment each other pretty well. One could define some kind of "Standard-Ignore" HandlerMap which is supplied as the last element to TakeoutParser. Custom functions could be given different priorities by list-order

current problems:

  1. parsing my-activity files yield a warning eg. "time data '12.07.2019, 14:51:43' does not match format '%b %d, %Y, %I:%M:%S %p"
  2. __main__.py was updated but doesn't handle the parser flag because I have no idea how to debug it. Currently I debug by importing classes as shown in the prev. example
  3. TakeoutParser._warn_if_no_activity path needs to be resolved via the localized HandlerMap. One could just hardcode a key to access the map, but I don't really like this solution

Is this concept understandable and beneficial? Are there any files I should keep in mind while working? How should I start adding tests for this?

purarue commented 1 year ago

Locale splitting looks good :+1:

Are there any files I should keep in mind while working?

For __main__.py, incase it wasnt obvious since the decorator is somewhat complicated, to add the CLI flag, it'd be:

diff --git a/google_takeout_parser/__main__.py b/google_takeout_parser/__main__.py
index d5fe165..31cd5e8 100644
--- a/google_takeout_parser/__main__.py
+++ b/google_takeout_parser/__main__.py
@@ -43,7 +43,7 @@ SHARED = [
     # TODO: actually handle argument
     click.option(
         "-l",
-        "--localization",
+        "--locale",
         type=click.Choice(["en", "ger"]),
         default="en",
         help="Used DEFAULT_HANDLER_MAP resoling folder names to parser models",
@@ -110,7 +110,7 @@ def parse(cache: bool, action: str, takeout_dir: str) -> None:
 @main.command(short_help="merge multiple takeout directories")
 @shared_options
 @click.argument("TAKEOUT_DIR", type=click.Path(exists=True), nargs=-1, required=True)
-def merge(cache: bool, action: str, takeout_dir: Sequence[str]) -> None:
+def merge(cache: bool, action: str, locale: str, takeout_dir: Sequence[str]) -> None:
     """
     Parse and merge multiple takeout directories
     """

Also I think should rename to --locale, shorter to type

How should I start adding tests for this?

For the path_handler you've done what I would've synced a copy of the directory -- good.

Otherwise, I think anywhere where you see differences in the actual structure of code where it leads to differences, you could extract an HTML div or JSON blob like I do in:

I dont think there should be many of those though, but if you're seeing errors there might be a differently structured file for different locales

parsing my-activity files yield a warning eg. "time data '12.07.2019, 14:51:43' does not match format '%b %d, %Y, %I:%M:%S %p"

I think that error in particular is the localizations of the dates when using the html exports (e.g. for the My Activity.html)

https://github.com/seanbreckenridge/google_takeout_parser/blob/master/google_takeout_parser/parse_html/html_time_utils.py

Those are pretty annoying bits of code since google takeout has changed a bunch and we have to account for all the changes (at least in the english locale).

I think those should be from My Acitvity.html, so unless you want to pass the locale to parse_html_dt, we could opt to just not include the HTML parsing code for other locales. The code to parse the HTML should be seen as legacy anyways -- as it says in the readme you should pick JSON for those formats, as it then just returns an UTC epoch timestamp instead of having to handle dealing with locales. Could warn the user to do exactly that if parsing html files with other locales

You should also just make sure none of the other html parsing code is incorrect in a different locale. E.g. does comment.py have differences in german or is the date format the same?

parthux1 commented 1 year ago

Several change requests marked as resolved are implemented as proposed by you.

I left 2 discussions open:

  1. Resolve File to Parser in dispatch_map I changed the var name and switched the actually not working as intended continue to break. Should be okay now but you may want to take a quick look. See this commit
  2. Switch Enum-Mapping to referencing the parser functions in each locale HandlerMap Let's discuss this in the corresponding thread.

what I still need to do:

  1. Think about the date-localization problem. I'm currently getting another takeout with jsons to check this.
    Will probably just supply some readme section about this and reference it in the warning.
  2. Check for other differences like 1. and add tests to the corresponding tests/test_-files
  3. Before merging: update readmes "Library Usage" section

Regarding __main__.py

How do you test the command line behavior?😅
When trying to call it via command line

python google_takeout_parser\__main__.py parse [File]

I get errors due to the relative . imports. Do I have to locally package it in some way?

purarue commented 1 year ago

How do you test the command line behavior

While in the the root directory, run:

pip install .

to install it, and then you can use python3 -m google_takeout_parser. You could also just run python3 -m google_takeout_parser without the pip install ., but I tend to install it just to make sure Im not mistakenly in the wrong directory and using the globally installed google_takeout_parser and not the local version