untitaker / html5gum

A WHATWG-compliant HTML5 tokenizer and tag soup parser
MIT License
148 stars 11 forks source link

I couldn't figure out how to nicely convert a streamable `reqwest::Response` into a `Readable` #48

Closed asottile closed 1 year ago

asottile commented 1 year ago

this is likely because I am bad at rust but I struggled to get this working (though in theory it shouldn't be too difficult?)

I was hoping to be able to do something like:

let resp = reqwest::get(u).await?;

for token in html5gum::Tokenizer::new(&resp) { ... }

or even with bytes_stream

let resp = reqwest::get(u).await?.bytes_stream();

for token in html5gum::Tokenizer::new(&resp) { ... }

fell back on

let resp = reqwest::get(u).await?.text().await?;

for token in html5gum::Tokenizer::new(&resp) { ... }

but I'm pretty sure that's not going to stream the response and it's going to convert it back and forth from Bytes -> String -> Vec -> String unnecessarily

untitaker commented 1 year ago

yup that's a known problem. html5gum ideally would accept:

  1. async streams (or whatever other async support there is, see #3)
  2. iterators of characters, bytes, string chunks, whatever

then additionally, html5gum should ideally return String tokens (not Bytes), if the input was already guaranteed valid utf-8. however, that's a lot of trait magic i have to do.

i think it's likely that, if #47 #21 ever lands, i'll revamp the I/O setup significantly. a lot of the inflexibility i introduced in html5gum was based around performance improvements i can make if the entire input stream is available as a contiguous block of bytes in memory

i don't think buffering up all input in memory is strictly worse for performance. most html documents should fit into your I/O buffer, and in my experience you save quite a bit of branching when passing a string into the tokenizer vs passing in a File object (even with a massive I/O buffer size)

i was curious what you were working on. seems like it's some sort of improvement to pip? I think pip already does not use a fully spec-compliant HTML5 parser. And since you're parsing literally only one webpage from a single party, I suspect your range of possible HTML "dialects" you have to deal with will be very limited, so a custom parser or quick-xml might work fine (and probably quicker too)

asottile commented 1 year ago

thanks for the advice !