Open AlexSun1995 opened 2 years ago
@AlexSun1995 Thanks for opening this issue, I'll try to help.
Can you please take a moment to provide Ruby code that reproduces the issue you're seeing? Please see https://raw.githubusercontent.com/sparklemotion/nokogiri/main/.github/ISSUE_TEMPLATE/3-bug-report.md for guidance on how to do this.
@AlexSun1995 Please help me reproduce what you're seeing?
sorry for late, notice such codes below:
require 'nokogiri'
doc = Nokogiri.XML('<foo><bar />𡏅</foo>', nil, 'EUC-JP')
puts doc.canonicalize()
the output would be:
running on jruby 9.3.2.0, I guess it's some java dependencies's bug
On CRuby 3.1.0 I see the correct output:
<foo><bar></bar>陝</foo>
On JRuby 9.3.2.0 I see:
<foo><bar></bar>??</foo>
So, yes, this appears to be an issue with the underlying JRuby implementation. I will take a look.
Looks like recent JDK implementations have moved to use writeCodepointToUtf8
, we can probably do that.
I'd like to remove the c14n code from Nokogiri and instead use the upstream JDK jars for this. That will be much easier to do once #1253 and #1967 get merged.
Good news: by moving to org.apache.santuario:xmlsec:2.3.0
, the encoding test you provided passes on JRuby.
Unfortunately, the previous c14n code was a patched version to work within the original c14n API design borrowed from libxml2, and so the block parameter passed to Document#canonicalize
is broken and I'll need some time to work around it in order to implement Node#canonicalize
for both JRuby and CRuby.
I've got a branch that I want to push to get more eyeballs on the solution which updates to santuario xmlsec but pulls in multiple dependencies and breaks the block parameter to Document#canonicalize
, but fixes some of the encoding problems. Won't make it into v1.14.0 but will be a candidate for v1.15.0 if we can work through the changes.
Will push the branch tomorrow as it's based on the branch in https://github.com/sparklemotion/nokogiri/pull/2546 and I want that to go green and get merged first.
See https://github.com/sparklemotion/nokogiri/pull/2547 for the PR
https://github.com/sparklemotion/nokogiri/blob/55029bfba481338825c99e78af2b182b1cc49e04/ext/java/nokogiri/internals/c14n/UtfHelpper.java#L51
since the Canonicalizer process input String character by character. Java uses 16 bits to represent a character; when the input string contains Unicode characters whose code pen are larger than 0Xffff(65535) it will be split into two char, since neither char will not be \ recognized, the Unicode characters will be transferred to 2 ??(3f) instead.
for example, if I want to canonicalize an input that contains 𡏅 via c14n, in the output, 𡏅 will be replaced with ??