Closed surpassly closed 3 years ago
Please provide a failing test case (but theoretically it should work)
Please provide a failing test case (but theoretically it should work) code:
doc, err := libxml2.ParseHTMLString("可以呢</p>") if err != nil { panic(err) } fmt.Println(doc)
output: `<?xml version="1.0" encoding="utf-8" standalone="yes"?> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">å ¯ä»¥å ¢
`
This passes in my environment. Can't reproduce
func TestGHIssue56(t *testing.T) {
doc, err := ParseHTMLString("可以呢</p>")
if !assert.NoError(t, err, `ParseHTMLString should work`) {
return
}
_ = doc
}
oh, may be .
my environment:
huaran@zhihudeMacBook-Pro-2 Desktop % pkg-config --modversion libxml-2.0
2.9.10
huaran@zhihudeMacBook-Pro-2 Desktop % sw_vers
ProductName: Mac OS X
ProductVersion: 10.15.5
BuildVersion: 19F101
huaran@zhihudeMacBook-Pro-2 Desktop % gcc --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/c++/4.2.1
Apple clang version 11.0.3 (clang-1103.0.32.62)
Target: x86_64-apple-darwin19.5.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
huaran@zhihudeMacBook-Pro-2 Desktop % go version
go version go1.14.2 darwin/amd64
I fix the issue by rewrite code.
Oh you know what, I think I finally understand what you meant with your code in https://github.com/lestrrat-go/libxml2/issues/56#issuecomment-725844995
Because you had a panic()
there, I incorrectly assumed you were saying your code produces a panic, but I think what you meant to say was that the encoding for the result is wrong? (This sort of mix up is EXACTLY why I ask for a failing test case and not a code snippet like you did)
Of course you get.a mangled output, because the input that you have given is an HTML fragment, and therefore is not a valid HTML with the necessary metadata for the parser to recognize what encoding the document is written in.
The easiest thing to do, actually, is to just prepend a proper preamble to your broken HTML:
func TestGHIssue56(t *testing.T) {
const preamble = `<html><meta http-equiv="content-type; charset=utf-8">`
const brokenInput = `可以呢</p>`
doc, err := ParseHTMLString(preamble + brokenInput)
if !assert.NoError(t, err, `ParseHTMLString should work`) {
return
}
t.Logf("%s", doc.String())
}
and voila, it works.
In general though, two things:
Oh you know what, I think I finally understand what you meant with your code in #56 (comment)
Because you had a
panic()
there, I incorrectly assumed you were saying your code produces a panic, but I think what you meant to say was that the encoding for the result is wrong? (This sort of mix up is EXACTLY why I ask for a failing test case and not a code snippet like you did)Of course you get.a mangled output, because the input that you have given is an HTML fragment, and therefore is not a valid HTML with the necessary metadata for the parser to recognize what encoding the document is written in.
The easiest thing to do, actually, is to just prepend a proper preamble to your broken HTML:
func TestGHIssue56(t *testing.T) { const preamble = `<html><meta http-equiv="content-type; charset=utf-8">` const brokenInput = `可以呢</p>` doc, err := ParseHTMLString(preamble + brokenInput) if !assert.NoError(t, err, `ParseHTMLString should work`) { return } t.Logf("%s", doc.String()) }
and voila, it works.
In general though, two things:
- DO NOT expect libxml2 to be able to parser HTML/XML fragments .
- It's your job to properly fix up the input before feeding it to libxml2
Great ! I know the reason. thank you very much !
which kind of encode and decoder should I use?