let4be / crusty

Broad Web Crawler
GNU General Public License v3.0
83 stars 3 forks source link

Implement faster HTML parsing #6

Closed let4be closed 3 years ago

let4be commented 3 years ago

As soon as crusty-core fully supports custom html processing I'd like to experiment a bit and find a faster way to extract links(and probably some meta data) from HTML

we don't need anything complex when doing broad web crawling so it should be possible to speed this up a lot(right now we do full DOM parsing)

Extracting links/title/meta should be easy to do with a simple tokenizer, like in https://docs.rs/html5ever/0.25.1/html5ever/tokenizer/index.html

let4be commented 3 years ago

ended up using https://github.com/cloudflare/lol-html