Closed asottile closed 1 year ago
yup that's a known problem. html5gum ideally would accept:
then additionally, html5gum should ideally return String tokens (not Bytes), if the input was already guaranteed valid utf-8. however, that's a lot of trait magic i have to do.
i think it's likely that, if #47 #21 ever lands, i'll revamp the I/O setup significantly. a lot of the inflexibility i introduced in html5gum was based around performance improvements i can make if the entire input stream is available as a contiguous block of bytes in memory
i don't think buffering up all input in memory is strictly worse for performance. most html documents should fit into your I/O buffer, and in my experience you save quite a bit of branching when passing a string into the tokenizer vs passing in a File object (even with a massive I/O buffer size)
i was curious what you were working on. seems like it's some sort of improvement to pip? I think pip already does not use a fully spec-compliant HTML5 parser. And since you're parsing literally only one webpage from a single party, I suspect your range of possible HTML "dialects" you have to deal with will be very limited, so a custom parser or quick-xml might work fine (and probably quicker too)
thanks for the advice !
this is likely because I am bad at rust but I struggled to get this working (though in theory it shouldn't be too difficult?)
I was hoping to be able to do something like:
or even with bytes_stream
fell back on
but I'm pretty sure that's not going to stream the response and it's going to convert it back and forth from Bytes -> String -> Vec -> String unnecessarily