seanbreckenridge / url_cache

A file system cache which saves URL metadata and summarizes content
https://pypi.org/project/url-cache/
Apache License 2.0
9 stars 1 forks source link

clean HTML to reduce size, fix text summary #6

Closed seanbreckenridge closed 3 years ago

seanbreckenridge commented 3 years ago

the only reason that the HTML is included in the responses is because it contains semantic/structural information that the text summary doesn't. Something like lynx or any HTML reader applies the default meaning of the HTML tags, which gives the page structure

Also, the text summary breaks anything that is ascii-art like, i.e. it strips each line of any whitespace so that breaks a lot of info

seanbreckenridge commented 3 years ago

Instead of trying to parse the cleaned HTML into text here, that should either be a separate library written by me, or someone else

Currently, replaced the example in the README to use lynx, which is a possibility

seanbreckenridge commented 3 years ago

Yeah, I think it'd be best to have this just provide the interface to cache things, instead of trying to add every feature/html minifying things -- better to sweep over the directory with something that can do that nicer instead of trying to do it all at request time