purarue / google_takeout_parser

A library/CLI tool to parse data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)
https://pypi.org/project/google-takeout-parser/
MIT License
82 stars 14 forks source link

use streaming html parser #40

Closed purarue closed 1 year ago

purarue commented 1 year ago

loading the whole html document into memory is pretty expensive memory wise, could either use a streaming html parser, or maybe split the file before loading it?

purarue commented 1 year ago

tried using lxml for this, havent been able to figure it out yet

https://github.com/seanbreckenridge/google_takeout_parser/commit/09307dafc85cf7b7e1d4572caabd2e53f5bc50aa

purarue commented 1 year ago

If anyone else has libraries they'd recommend here, I'm very open to suggestions, all my experiments haven't gone well

purarue commented 1 year ago

ended up just using an html tokenizer in go

this is all legacy anyways, so I dont know if anyone else is ever even going to use this, is more for my own usage