As soon as crusty-core fully supports custom html processing I'd like to experiment a bit and find a faster way to extract links(and probably some meta data) from HTML
we don't need anything complex when doing broad web crawling so it should be possible to speed this up a lot(right now we do full DOM parsing)
As soon as
crusty-core
fully supports custom html processing I'd like to experiment a bit and find a faster way to extract links(and probably some meta data) from HTMLwe don't need anything complex when doing broad web crawling so it should be possible to speed this up a lot(right now we do full DOM parsing)
Extracting links/title/meta should be easy to do with a simple tokenizer, like in https://docs.rs/html5ever/0.25.1/html5ever/tokenizer/index.html