lt, gt chars parsing inside a tag

iharsuvorau commented 7 years ago

Is it correct that when using sanitize.HTML the HTML like <p>1 < 2</p> won't be parsed accurately?

kennygrant commented 7 years ago

What is accurately? What do you see, and what do you expect to see?

iharsuvorau commented 7 years ago

Sanitizing <p>1 < 2</p> I'm getting 1 (with a space after the number) and I want to get 1 < 2. That's because the parser ignores everything after the < char even if it's not a HTML-tag.

kennygrant commented 7 years ago

I'm not sure that's valid html, you're supposed to escape less than signs (in html 5 at least). If you try validating this html here:

<!DOCTYPE html>
<html>
<head>
<title>asdf</title>
</head>
<body>
<p>1 < 2</p>
</body>
</html>

https://validator.w3.org/#validate_by_input

It will fail. It's hard for the parser to know if this is the start of a tag, or a less than sign, so the entity would be better. Browsers may try to work around it as they're used to broken html, but I would try to fix the html.

You could pre-process it by searching for ' < ' and replacing with ' < ', but I think I'd rather not do that in sanitize.

kennygrant / sanitize

lt, gt chars parsing inside a tag #18