servo / html5ever

High-performance browser-grade HTML5 parser
Other
2.15k stars 222 forks source link

Using html5ever in wasm package for an isomorphic html sanitizer #497

Open dejang opened 1 year ago

dejang commented 1 year ago

Hello,

I am looking at ways to build an HTML5 sanitizer capable of running in both Browser, NodeJS and Java environments, Java being the lowest priority at the moment. The most important requirement is to not rely on a DOM to be able to operate in these environments. I stumbled upon html5ever and it looks like the perfect tool to use for my scenario with the added benefit that it's part of the Servo project.

For Browser and NodeJS environments I would have to produce WASM artifacts given the simplicity of dealing with multiple platforms in NodeJS but also because of environments where I may not be able to load NodeJS binary native plugins. For Browser environments or mobile WebView there is no other option than producing a WASM artifact so these are the restrictions around the distribution process which I am fine with.

I am using Rust to build the sanitizer so this keeps things easy to manage staying in the same programming language all the way in the development process.

Currently, when compiling html5ever to WASM I get an output of 450kb even when running it through wasm-opt and being very aggressive on the optimizations for size. Unfortunately that is way too big of a file for the Web. Ideally, if it can be around 50kb it would make html5ever a much more desirable alternative to existing Javascript sanitizers for the browser.

I would like to ask if there is a way to either compile html5ever to WASM so that I can reach my desired target size or, alternatively, use only features from the parser that I currently need in hopes that by doing this I will manage to shave off a considerable amount of code. My main scenario is the following: given a string containing HTML, produce a DOM tree which can be traversed to identify tags, attributes and attribute values which should be eliminated. Return a string.

Thank you for taking the time to read this issue, hopefully with your help I'll be able to use html5ever to achieve my goals.

jdm commented 1 year ago

I have no experience with attempting to minimize wasm builds, so I can't provide any assistance there. Html5ever is designed to follow a specific parsing algorithm that is web-compatible, and I'm unaware of any optional features that can be disabled as a result.

tetsuharuohzeki commented 1 year ago

@dejang

From my few experience to minimize wasm build size, at least Rust v1.69, it can reduce the size aggressively to enable lto option than to do post processing by wasm-opt. There are FullLTO or ThinLTO, either is fine.

I hope you try it out :)

dejang commented 1 year ago

@tetsuharuohzeki I was using nightly for this one with lto optimization. It seems the 450kb is the best I could do after a bit of fiddling around with optimization settings.