purarue / google_takeout_parser

A library/CLI tool to parse data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)
https://pypi.org/project/google-takeout-parser/
MIT License
82 stars 14 forks source link

Add additional fields to existing models #76

Closed mighabana closed 1 month ago

mighabana commented 1 month ago

Hi, I was reviewing the parser and comparing its output with my own Google Takeout data and I noticed some interesting fields that are not being handled by the parser. I went ahead and made some changes to include them, but if you think they are not important to track I'd be happy to discuss! 😄

Summary of Changes

  1. PlayStoreAppInstall model
    • Added deviceName, deviceCarrier, deviceManufacturer and firstInstallationDt fields
    • Changed dt field from the firstInstallationTime to the lastUpdateTime since this may be a more accurate representation of the installation event. I understand this may have some backward compatibility issues so it might be better to create a new installationTime field instead but then the name dt can be misleading. I'd love to hear your thoughts on this.
  2. Location model
    • Added deviceTag and source fields
    • This addition introduced a failing test in test_location_new() caused by how orjson handles large integers. I've already raised the issue in the orjson repository here.
    • Added a new test test_location_2024 that tests based on the current version of the location data.
  3. ChromeHistory model
    • Added the pageTransition field
  4. Added "Chrome/History.json" to the en locale HandlerMapping. In my version of the Takeout files it seems to have been renamed from BrowserHistory.json to History.json.
  5. Added a new test for the Semantic Location data called _parse_semantic_location_history_2024 with the current version of the data mocked. I was going to try to include the additional activitySegment field, but it seems a bit complex. I'd be happy to discuss how to parse it in a separate issue.
purarue commented 1 month ago

Yeah, these are welcome improvements, always willing to parse additional useful data.

Haven't looked at the code but generally as a guide for any new additions, if looking up a key in a dictionary should be Optional[str] and use a .get() on the dictionary as we want to maintain backwards compatibility and not break parsing old exports.

Will take a look later today

purarue commented 1 month ago

Changed dt field from the firstInstallationTime to the lastUpdateTime since this may be a more accurate representation of the installation event

yeah, I think this is a small enough breaking change that its worth it. I will bump the takeout parser version which should force a full recompute for caches after merging this.

if you could just put some comments into the code itself describing how the two fields are different and that you've noticed that one is more accurate that the other, that would be great.

I understand this may have some backward compatibility issues so it might be better to create a new installationTime field instead but then the name dt can be misleading. I'd love to hear your thoughts on this.

yeah, if we're including both times, I think we can do that

you can use a dt property wrapper and then maybe in a bit I can mark it as deprecated? not sure what the best way to do that with a @property is

I think since we're adding the wrapper you can just use the keys that google does

so maybe something like

class ...:
    lastUpdateTime: datetime
    installTime: datetime

    @property
    def dt(self) -> datetime:
        return self.lastUpdateTime

and then just update the key function to use lastUpdatedTime instead of dt so there isn't a function call every time, its just there to prevent code that was using it from breaking

activitySegment

yeah, Im tracking that here if you want to leave any thoughts, I tried a few months ago and didn't feel my solution was great, it felt prone to breaking...

I think cachew now supports serializing dicts in its latest versions, so maybe it would be useful to just include the whole dict and pass it through unparsed so that the user can use it if they want? Dont feel you have to do that in this PR, you can open another one for that.

orjson

weird bug, wonder if this has something to do with JSON standard for integers/floating point...

If there isn't a response there for a while I'll think about a flag to optionally enable/disable orjson in lib code

feel free to ping me here again if this PR becomes a bit stale, I am pretty busy with other stuff right now so it might slip my mind

mighabana commented 1 month ago

Thanks for taking a look!

if you could just put some comments into the code itself describing how the two fields are different and that you've noticed that one is more accurate that the other, that would be great.

Sure I'll add some comments in my next commit to clarify these changes.

so maybe something like...

Yeah this makes a lot of sense to me. Thanks for the suggestion I would not have thought of this as a solution without it. I'll make this change as well in a while.

feel free to ping me here again if this PR becomes a bit stale, I am pretty busy with other stuff right now so it might slip my mind

No worries, I can work off of my fork for other downstream stuff in the meantime but if there's anything else I need to clean up on this PR just let me know as well. 😄

karlicoss commented 1 month ago

weird bug, wonder if this has something to do with JSON standard for integers/floating point...

Yeah in the example @mighabana provided, "deviceTag": -80241446968629135069, -- this doesn't fit in 64 bits, which is supported in orjson. I imagine this would be a won't fix as orjson is implemented in C, and doesn't looks like orjson has some sort of decoder customizations (e.g. to parse it as string -- in principle it doesn't matter since no one is going to do math on device tags anyway :stuck_out_tongue: )

Actually I checked my location history, and looks like all device tags are not as big in my case. It is possible that @seanbreckenridge just randomly typed something here as mock data? :) https://github.com/seanbreckenridge/google_takeout_parser/commit/cb07912c61c831d58e24f33b280051a730bdef30

In that case we just need to change it in test as it wouldn't occur in reality.

purarue commented 1 month ago

Heh, I don't think I just typed a random number for the device tag, but will compare with my data later.

mighabana commented 1 month ago

Hey @karlicoss thanks for the additional review! I'll resolve the issues you raised soon.

Actually I checked my location history, and looks like all device tags are not as big in my case.

Same for me actually my device tags were not as big, but I just didn't want to assume and thought that the large integers could be a small edge case. 🤔

Also is there's any reason only LocationHistory is parsed with orjson? I just checked and saw that the other parsers are using json.loads() instead of the _read_json_data() function. Maybe we can switch to use json to fix the issue since the bug doesn't seem to be occurring with it? I'd also be happy to make another code change to switch them all to _read_json_data() if that would be more appropriate.

purarue commented 1 month ago

ok, checked back on my device tag and it looks to be around something like

-1798586493

so that probably just was me anonymizing the data and not considering the integer size since python has no bit-specific integers

you can update the test to something more reasonable @mighabana

mighabana commented 1 month ago

Reverted back the remaining issues with the dictionary access. Thanks again for the review!

purarue commented 1 month ago

Alright great, will take a look at it again later today and merge

karlicoss commented 4 weeks ago

Thanks! @seanbreckenridge could you cut off a release please when you got a minute? :)

purarue commented 4 weeks ago

Yeah sure, will do one in a couple hours

purarue commented 4 weeks ago

@karlicoss published v0.1.12