bjornri commented 1 year ago

HtmlSanitizer.GetAllNodes() uses the program stack to iterate on elements in the DOM, and when we try to parse a deeply nested email (a few thousand elements deep) we get a stack overflow exception. Because our app runs in IIS it has a very small stack size. However, even if we increase our stack size, we will still crash - we just need a bigger HTML document. And when we crash with stack overflow, the entire app goes down.

See https://github.com/mganss/HtmlSanitizer/issues/417 as this is the same issue as we are having.

Even though HtmlSanitizer throws StackOverflowException, AngleSharp is still able to parse the HTML.

I have rewritten GetAllNodes() to use a stack on the heap to avoid stack overflows, so that if AngleSharp can parse then HtmlSanitizer can sanitize it. I don't think HtmlSanitizer rely on the ordering of the returned enumeration, but I have kept the depth-first ordering we got when using the recursive approach.

Note that there is the same issue with StyleExtensions.GetStylesheets() in AngleSharp that still cause HtmlSanitizer to crash when enumerating style sheets. I plan to send a PR to that repository as well.

I think existing unit tests has enough coverage for this method so I did not create any additional tests.

codecov[bot] commented 1 year ago

Codecov Report

Base: 94.43% // Head: 94.64% // Increases project coverage by +0.20% :tada:

Coverage data is based on head (a0da33f) compared to base (008cff7). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #423 +/- ## ========================================== + Coverage 94.43% 94.64% +0.20% ========================================== Files 6 6 Lines 827 840 +13 Branches 79 83 +4 ========================================== + Hits 781 795 +14 Misses 34 34 + Partials 12 11 -1 ``` | [Impacted Files](https://codecov.io/gh/mganss/HtmlSanitizer/pull/423?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Michael+Ganss) | Coverage Δ | | |---|---|---| | [src/HtmlSanitizer/HtmlSanitizer.cs](https://codecov.io/gh/mganss/HtmlSanitizer/pull/423?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Michael+Ganss#diff-c3JjL0h0bWxTYW5pdGl6ZXIvSHRtbFNhbml0aXplci5jcw==) | `93.41% <100.00%> (+0.48%)` | :arrow_up: | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Michael+Ganss). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Michael+Ganss)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

bjornri commented 1 year ago

It looks like if we are to reach the codecov/patch target, the parameter on GetAllNodes() must be INode (not nullable) and callers have to check for null before calling it. I added a commit with this change.

mganss commented 1 year ago

Thanks!

mganss / HtmlSanitizer

Rewrite GetAllNodes to use a stack on the heap #423

Codecov Report