seanbreckenridge / google_takeout_parser

A library/CLI tool to parse data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)
https://pypi.org/project/google-takeout-parser/
MIT License
79 stars 14 forks source link

use streaming JSON parser (ijson) #42

Open karlicoss opened 1 year ago

karlicoss commented 1 year ago

I guess not a super big deal since we use caching, but it does give significant (almost 2x speedups)

Had good success using it for a couple of DALs https://github.com/karlicoss/exporthelpers/blob/804b8afa070d8017ad15710a2a179e71ea60316f/dal_helper.py#L140-L171 (made it an optional dependency for backwards compatibility since ijson involves some binaries which might be unavailable for some platforms)

related: https://github.com/seanbreckenridge/google_takeout_parser/issues/40

seanbreckenridge commented 1 year ago

Ah yeah, totally down for adding this, falling back to default behaviour if it fails

seanbreckenridge commented 1 year ago

have been thinking more about this with me adding more formats to browserexport, will probably create a meta-package like you have in exporthelpers that this will have as a dependency

karlicoss commented 1 year ago

Another relevant thing that may be worth extracting from HPI is a library for accessing compressed stuff https://github.com/karlicoss/kompress/issues/10 I think after a few years unfortunately there isn't anything existing

karlicoss commented 1 year ago

started extracting kompress stuff here btw https://github.com/karlicoss/kompress -- will add more docs and think if needs any refactoring and then will move HPI and bleanser to use it

seanbreckenridge commented 1 year ago

looks good

I think the only thing it doesnt meet my usecase for is .gz files (not .tar.gz files)

like here: https://github.com/seanbreckenridge/browserexport/blob/734bc46e9200cc888d8146c31d55e7caa039c4e2/browserexport/parse.py#L73

gzip has the same rb -> rt problem lzma does

will PR that, would be nice to be able to use that in my tools instead of re-implementing it everywhere