JRuby: UtfHelpper.writeCharToUtf8 cannot handle unicode supplementary character

sparklemotion / nokogiri

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.

https://nokogiri.org/

MIT License

6.14k stars 897 forks source link

JRuby: UtfHelpper.writeCharToUtf8 cannot handle unicode supplementary character #2410

Open AlexSun1995 opened 2 years ago

AlexSun1995 commented 2 years ago

https://github.com/sparklemotion/nokogiri/blob/55029bfba481338825c99e78af2b182b1cc49e04/ext/java/nokogiri/internals/c14n/UtfHelpper.java#L51

since the Canonicalizer process input String character by character. Java uses 16 bits to represent a character; when the input string contains Unicode characters whose code pen are larger than 0Xffff（65535) it will be split into two char, since neither char will not be \ recognized, the Unicode characters will be transferred to 2 ??(3f) instead.

for example, if I want to canonicalize an input that contains 𡏅 via c14n, in the output, 𡏅 will be replaced with ??

flavorjones commented 2 years ago

@AlexSun1995 Thanks for opening this issue, I'll try to help.

Can you please take a moment to provide Ruby code that reproduces the issue you're seeing? Please see https://raw.githubusercontent.com/sparklemotion/nokogiri/main/.github/ISSUE_TEMPLATE/3-bug-report.md for guidance on how to do this.

flavorjones commented 2 years ago

@AlexSun1995 Please help me reproduce what you're seeing?

AlexSun1995 commented 2 years ago

sorry for late, notice such codes below:

require 'nokogiri'

doc = Nokogiri.XML('<foo><bar />𡏅</foo>', nil, 'EUC-JP')
puts doc.canonicalize()

the output would be:

running on jruby 9.3.2.0, I guess it's some java dependencies's bug

flavorjones commented 2 years ago

On CRuby 3.1.0 I see the correct output:

<foo><bar></bar>陝</foo>

On JRuby 9.3.2.0 I see:

<foo><bar></bar>??</foo>

So, yes, this appears to be an issue with the underlying JRuby implementation. I will take a look.

flavorjones commented 2 years ago

Looks like recent JDK implementations have moved to use writeCodepointToUtf8, we can probably do that.

flavorjones commented 2 years ago

I'd like to remove the c14n code from Nokogiri and instead use the upstream JDK jars for this. That will be much easier to do once #1253 and #1967 get merged.

flavorjones commented 2 years ago

Good news: by moving to org.apache.santuario:xmlsec:2.3.0, the encoding test you provided passes on JRuby.

Unfortunately, the previous c14n code was a patched version to work within the original c14n API design borrowed from libxml2, and so the block parameter passed to Document#canonicalize is broken and I'll need some time to work around it in order to implement Node#canonicalize for both JRuby and CRuby.

flavorjones commented 2 years ago

I've got a branch that I want to push to get more eyeballs on the solution which updates to santuario xmlsec but pulls in multiple dependencies and breaks the block parameter to Document#canonicalize, but fixes some of the encoding problems. Won't make it into v1.14.0 but will be a candidate for v1.15.0 if we can work through the changes.

Will push the branch tomorrow as it's based on the branch in https://github.com/sparklemotion/nokogiri/pull/2546 and I want that to go green and get merged first.

flavorjones commented 2 years ago

See https://github.com/sparklemotion/nokogiri/pull/2547 for the PR