Open mmcdole opened 1 year ago
I'm guessing I need to handle this by first:
I could do something like:
func convertToUTF8(data []byte) (string, error) {
reader, err := charset.NewReader(bytes.NewReader(data), "")
if err != nil {
return "", err
}
utf8Data, err := ioutil.ReadAll(reader)
if err != nil {
return "", err
}
return string(utf8Data), nil
}
func sanitizeXML(xmlData []byte) (string, error) {
utf8Data, err := convertToUTF8(xmlData)
if err != nil {
utf8Data = string(xmlData) // Fallback to original data if conversion fails
}
var buffer bytes.Buffer
for _, r := range utf8Data {
if isLegalXMLChar(r) {
buffer.WriteRune(r)
} else {
buffer.WriteString(fmt.Sprintf("&#x%X;", r))
}
}
return buffer.String(), nil
}
I could call this at the beginning of the sanitize function, but I'm not sure what I'd do if charset.NewReader
failed to detect the encoding.
Following issue #180, #25 and some other issues, I'd like to make character sanitization more robust.
I've previously tried to have the code do something like the following:
However, there is an old issue #21 that indicated that when I sanitized these characters, it then messed up parsing non-utf8 feeds.
If anyone has any suggestions for how to accommodate both requirements:
It would be much appreciated!