jhy / jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
https://jsoup.org
MIT License
10.95k stars 2.19k forks source link

Simplified the Entities.escape method #2183

Closed jhy closed 3 months ago

jhy commented 3 months ago

Introduced an options bitset instead of all those boolean method options. Reduced cyclomatic complexity of from 29 to 14.

Improved throughput of escape around 22% if the content contains characters in supplemental plane, by no longer going through the charset encoder to test can encode, but pushing it into the CoreCharset. That removes a ByteBuffer allocation on each hit.

jhy commented 3 months ago

I perf tested using this wiki list of colors, which has some interesting attributes like <p title="𝗛𝗦𝗩 (34° 14% 98%)&#10;𝗥𝗚𝗕 (250 235 215)&#10;𝗛𝗘𝗫 #FAEBD7">. Those are mathematical sans-serif bold capital h etc; &#120283; if encoded in ASCII (in UTF we don't need to encode).

Average html() operations per second went from ~ 156 to ~ 186.

jhy commented 3 months ago

And some further improvements in 27e7d5f03d85bea15fbfe5432c37c6de94965a82. Getting around 235 ops/sec (from 156 -- so about 49% faster now!)