mmcdole / gofeed

Parse RSS, Atom and JSON feeds in Go
MIT License
2.56k stars 208 forks source link

Make illegal character sanitization more robust #206

Open mmcdole opened 1 year ago

mmcdole commented 1 year ago

Following issue #180, #25 and some other issues, I'd like to make character sanitization more robust.

I've previously tried to have the code do something like the following:

func sanitizeXML(xmlData string) string {
    var buffer bytes.Buffer

    for _, r := range xmlData {
        if isLegalXMLChar(r) {
            buffer.WriteRune(r)
        } else {
            // Replace illegal characters with their XML character reference.
            // You can also skip writing illegal characters by commenting the next line.
            buffer.WriteString(fmt.Sprintf("&#x%X;", r))
        }
    }

    return buffer.String()
}

func isLegalXMLChar(r rune) bool {
    return r == 0x9 || r == 0xA || r == 0xD ||
        (r >= 0x20 && r <= 0xD7FF) ||
        (r >= 0xE000 && r <= 0xFFFD) ||
        (r >= 0x10000 && r <= 0x10FFFF)
}

However, there is an old issue #21 that indicated that when I sanitized these characters, it then messed up parsing non-utf8 feeds.

If anyone has any suggestions for how to accommodate both requirements:

It would be much appreciated!

mmcdole commented 1 year ago

I'm guessing I need to handle this by first:

  1. Parsing non-UTF8 feeds into UTF8 first
  2. Sanitize the feed afterwards

I could do something like:

func convertToUTF8(data []byte) (string, error) {
    reader, err := charset.NewReader(bytes.NewReader(data), "")
    if err != nil {
        return "", err
    }
    utf8Data, err := ioutil.ReadAll(reader)
    if err != nil {
        return "", err
    }
    return string(utf8Data), nil
}

func sanitizeXML(xmlData []byte) (string, error) {
    utf8Data, err := convertToUTF8(xmlData)
    if err != nil {
        utf8Data = string(xmlData) // Fallback to original data if conversion fails
    }

    var buffer bytes.Buffer

    for _, r := range utf8Data {
        if isLegalXMLChar(r) {
            buffer.WriteRune(r)
        } else {
            buffer.WriteString(fmt.Sprintf("&#x%X;", r))
        }
    }

    return buffer.String(), nil
}

I could call this at the beginning of the sanitize function, but I'm not sure what I'd do if charset.NewReader failed to detect the encoding.