microcosm-cc / bluemonday

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS
https://github.com/microcosm-cc/bluemonday
BSD 3-Clause "New" or "Revised" License
3.12k stars 176 forks source link

Allow Formatted Email Addresses #150

Closed teschste-reyrey closed 1 year ago

teschste-reyrey commented 1 year ago

I am not using bluemonday for a web site, I am just using it as an HTML tag stripper, using StrictPolicy(), to generate searchable text without interference from the display tags. However, I have come across one anomaly, which is when a user enters a formatted email address in their text, such as "John Smith JohnSmith@abc.com" In that scenario, JohnSmith@abc.com is removed from the resulting text because it appears to be some type of tag.

I have tried various regex patterns with the AllowElementsMatching modifier but I have not been able to come up with a way to allow an email address in that format to remain in the result text.

Any help on how to get around this would be appreciated!

buro9 commented 1 year ago

Ah... interesting.

So you're getting as input John Smith <JohnSmith@abc.com> and it's seeing the email < and > as a HTML tag.

In essence the problem of using a HTML aware sanitizer on non-HTML.

I would not be trying to solve this through this library, but would instead try to look at another way to preserve this.

I don't know your input... but have you considered treating the input as Markdown prior to sanitization? In Markdown an email is <email@domain.com> and will be rendered as a HTML anchor with the email inside <a href="mailto:email@domain.com">email@domain.com</a>, and now when the strict policy is applied it would preserve the text inside the anchor. For that you could look at running this: https://github.com/russross/blackfriday before bluemonday.

teschste-reyrey commented 1 year ago

Actually my input is HTML from an HTML text editor. However I also provide the ability for the user to search the text they entered and I need to ensure the search ignores the HTML tags or the result gets weird. I will look into blackfriday. Thanks for the quick response!

teschste-reyrey commented 1 year ago

I played with blackfriday and was originally hopeful because when I passed it just the string that had the email address in it, such as Send email to John Smith <JohnSmith@abc.com>., it worked perfectly. However when I used the actual HTML from the text editor, for example <p>Send email to John Smith <JohnSmith@abc.com>.</p>, blackfriday appeared to ignore the text entirely and simply returned the exact same string, so it seems that solution will not work for my scenario.

teschste-reyrey commented 1 year ago

I was able to find a solution for my scenario and thought I would share it in case anyone else has the issue. Basically, before I process the text with bluemonday, I replace any <JohnSmith@abc.com> with the same text minus the < and > (i.e. JohnSmith@abc.com), The code I am using is as follows:

  var r = regexp.MustCompile(`(?i)<\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b>`)
  var ranges [][]int
  var temp string

  ranges = r.FindAllIndex([]byte(text), -1)

  if len(ranges) == 0 {
    // No formatted email address exists, process the text as-is.
    temp = text
  } else {
    if ranges[0][0] == 0 {
      // The formatted email address is at the beginning of the string, so skip it.
      temp = ""
    } else {
      // Get the text up to the formatted email address, dropping the <.
      temp = text[0:ranges[0][0]]
    }

    // Loop through all occurrences of formatted email addresses in the text.
    for idx := 0; idx < len(ranges); idx++ {
      // Add the formatted email address to the temp string, dropping the < and >.
      temp += text[ranges[idx][0]+1:ranges[idx][1]-1]

      if idx < (len(ranges) - 1) {
        // there is at least one more range.
        if ranges[idx][1] < ranges[idx + 1][0] {
          // Grab any text between the current and next occurrence.
          temp += text[ranges[idx][1]:ranges[idx + 1][0]]
        }
      }
    }

    if ranges[len(ranges)-1][1] < len(text) {
      // The formatted email address is not at the end of the text, so grab the rest of the text
      // after the final occurrence.
      temp += text[ranges[len(ranges)-1][1]:]
    }
  }

  // Strip all remaining HTML, reversing any characters bluemonday escaped (to provide clean
  // searchable text).
  fmt.Println(html.UnescapeString(bluemonday.StrictPolicy().Sanitize(temp)))

This issue can be closed.