mganss / HtmlSanitizer

Cleans HTML to avoid XSS attacks
MIT License
1.55k stars 200 forks source link

Time taken to sanitize html #326

Closed pragatibaheti closed 2 years ago

pragatibaheti commented 2 years ago

I am looking into the details of the library for sanitization and worked on sample inputs. I profiled some time summary that if we pass a complete HTML document close to 4000bytes it takes an avg of 1sec to sanitize (This has some CSS and HTML code containing div tags).

Using the same instance if we sanitize a HTML string (not complete HTML document) it takes 1sec for lesser size (~1000bytes also containing simple tags like a,u,b,i). Any reason for this behavior or what can lead to faster results as time plays a critical role in my use-case?

Would highly appreciate some suggestions.

mganss commented 2 years ago

One second sounds like there is an issue somewhere. The unit tests have a test case that runs all the other tests (177 currently) in parallel on 16 threads and repeats this 1000 times (a brute force approach to discover multithreading issues). This test takes a little more than 2s on my desktop machine.

Can you attach a sample for each of the cases you have mentioned (document vs. fragment)?

pragatibaheti commented 2 years ago

Sure I am attaching a code snippet having a fragment. This takes close to 0.8sec. samplehtml.txt

Also, sanitizing a small fragment <div onclick="alert('xss')">hello</div><img src="javascript:void(0)"/> takes 0.5 sec. Here is another sample https://dotnetfiddle.net/EVBg58 Can you please confirm.

mganss commented 2 years ago

These are mainly startup costs. Here's another fiddle demonstrating this (approx. 0.1ms per run): https://dotnetfiddle.net/yIut58

Try creating the HtmlSanitizer object as a singleton within your application, then call Sanitize() multiple times on this object, possibly from multiple threads as well.

pragatibaheti commented 2 years ago

Ok got it. But if my use-case demands calling sanitize once, then this time can be expected right? Also does this start time gets significantly change on the basis of configs we pass?

Is there any difference in time we can expect if we call sanitize/sanitizedom/sanitizedocument?

mganss commented 2 years ago

Ok got it. But if my use-case demands calling sanitize once, then this time can be expected right?

Yes. I believe the cost is mostly .NET infrastructure cost (JIT, assembly loading etc). For example, running the following just once takes 0.2s on my machine, too:

var json = @"{""key"": ""val""}";
var p = JsonConvert.DeserializeObject<Product>(json);

The average time when running this 10k times in a loop is just 0.05ms, though.

Also does this start time gets significantly change on the basis of configs we pass?

I don't think so.

Is there any difference in time we can expect if we call sanitize/sanitizedom/sanitizedocument?

I don't think so, either.

pragatibaheti commented 2 years ago

Thanks for the clarification