lestrrat-go / libxml2

Interface to libxml2, with DOM interface
MIT License
230 stars 55 forks source link

Can't deal with Chinese character #56

Closed surpassly closed 3 years ago

surpassly commented 5 years ago

which kind of encode and decoder should I use?

lestrrat commented 5 years ago

Please provide a failing test case (but theoretically it should work)

FanHuaRan commented 3 years ago

Please provide a failing test case (but theoretically it should work) code: doc, err := libxml2.ParseHTMLString("可以呢</p>") if err != nil { panic(err) } fmt.Println(doc) output: `<?xml version="1.0" encoding="utf-8" standalone="yes"?> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">

å¯ä»¥å¢

`

lestrrat commented 3 years ago

This passes in my environment. Can't reproduce

func TestGHIssue56(t *testing.T) {
  doc, err := ParseHTMLString("可以呢</p>")
  if !assert.NoError(t, err, `ParseHTMLString should work`) {
    return
  }
  _ = doc
}
FanHuaRan commented 3 years ago

oh, may be . my environment: huaran@zhihudeMacBook-Pro-2 Desktop % pkg-config --modversion libxml-2.0 2.9.10 huaran@zhihudeMacBook-Pro-2 Desktop % sw_vers
ProductName: Mac OS X ProductVersion: 10.15.5 BuildVersion: 19F101 huaran@zhihudeMacBook-Pro-2 Desktop % gcc --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/c++/4.2.1 Apple clang version 11.0.3 (clang-1103.0.32.62) Target: x86_64-apple-darwin19.5.0 Thread model: posix InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin huaran@zhihudeMacBook-Pro-2 Desktop % go version go version go1.14.2 darwin/amd64

FanHuaRan commented 3 years ago

I fix the issue by rewrite code.

lestrrat commented 3 years ago

Oh you know what, I think I finally understand what you meant with your code in https://github.com/lestrrat-go/libxml2/issues/56#issuecomment-725844995

Because you had a panic() there, I incorrectly assumed you were saying your code produces a panic, but I think what you meant to say was that the encoding for the result is wrong? (This sort of mix up is EXACTLY why I ask for a failing test case and not a code snippet like you did)

Of course you get.a mangled output, because the input that you have given is an HTML fragment, and therefore is not a valid HTML with the necessary metadata for the parser to recognize what encoding the document is written in.

The easiest thing to do, actually, is to just prepend a proper preamble to your broken HTML:

func TestGHIssue56(t *testing.T) {
  const preamble = `<html><meta http-equiv="content-type; charset=utf-8">`
  const brokenInput = `可以呢</p>`
  doc, err := ParseHTMLString(preamble + brokenInput)
  if !assert.NoError(t, err, `ParseHTMLString should work`) {
    return
  }
  t.Logf("%s", doc.String())
}

and voila, it works.

In general though, two things:

  1. DO NOT expect libxml2 to be able to parser HTML/XML fragments .
  2. It's your job to properly fix up the input before feeding it to libxml2
FanHuaRan commented 3 years ago

Oh you know what, I think I finally understand what you meant with your code in #56 (comment)

Because you had a panic() there, I incorrectly assumed you were saying your code produces a panic, but I think what you meant to say was that the encoding for the result is wrong? (This sort of mix up is EXACTLY why I ask for a failing test case and not a code snippet like you did)

Of course you get.a mangled output, because the input that you have given is an HTML fragment, and therefore is not a valid HTML with the necessary metadata for the parser to recognize what encoding the document is written in.

The easiest thing to do, actually, is to just prepend a proper preamble to your broken HTML:

func TestGHIssue56(t *testing.T) {
  const preamble = `<html><meta http-equiv="content-type; charset=utf-8">`
  const brokenInput = `可以呢</p>`
  doc, err := ParseHTMLString(preamble + brokenInput)
  if !assert.NoError(t, err, `ParseHTMLString should work`) {
    return
  }
  t.Logf("%s", doc.String())
}

and voila, it works.

In general though, two things:

  1. DO NOT expect libxml2 to be able to parser HTML/XML fragments .
  2. It's your job to properly fix up the input before feeding it to libxml2

Great ! I know the reason. thank you very much !