gjtorikian / html-pipeline

HTML processing filters and utilities
MIT License
2.27k stars 383 forks source link

Migrate from Nokogiri to Selma #368

Closed gjtorikian closed 1 year ago

gjtorikian commented 1 year ago

This completes one of the blocking items for a v3 release: migrating to Selma. Selma's process for parsing html uses lol-html, a Rust library, as its underlying mechanism. It operates purely on CSS selectors, and has a few advantages over the current Nokogiri based mechanism. Namely, in the current systen, text is parsed over multiple times. Each NodeFilter runs its set of CSS selectors synchronously. lol-html, on the other hand, parses the text once, compiles its CSS selectors, and more efficiently allows for text manipulation.