Closed kostya closed 2 months ago
Hi,
Thank for reporting this.
Can you provide the output from nokogiri -v
so that we know what your
environment looks like?
-m On May 10, 2015 11:32 AM, "kostya" notifications@github.com wrote:
1.6.6.2
require 'bundler/setup'require 'nokogiri' class Doc < Nokogiri::XML::SAX::Document def characters(chars) p chars endend
handler = Doc.new parser = Nokogiri::HTML::SAX::Parser.new(handler) parser.parse_memory("
= - – & ") do |ctx| ctx.replace_entities = falseend" " "=" " - " "–" " " "&" " "
— Reply to this email directly or view it on GitHub https://github.com/sparklemotion/nokogiri/issues/1284.
# Nokogiri (1.6.6.2)
---
warnings: []
nokogiri: 1.6.6.2
ruby:
version: 2.2.0
platform: x86_64-darwin13
description: ruby 2.2.0p0 (2014-12-25 revision 49005) [x86_64-darwin13]
engine: ruby
libxml:
binding: extension
source: packaged
libxml2_path: "/Users/kostya/.rbenv/versions/2.2.0/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.6.2/ports/x86_64-apple-darwin13.0.0/libxml2/2.9.2"
libxslt_path: "/Users/kostya/.rbenv/versions/2.2.0/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.6.2/ports/x86_64-apple-darwin13.0.0/libxslt/1.1.28"
libxml2_patches:
- 0001-Revert-Missing-initialization-for-the-catalog-module.patch
- 0002-Fix-missing-entities-after-CVE-2014-3660-fix.patch
libxslt_patches:
- 0001-Adding-doc-update-related-to-1.1.28.patch
- 0002-Fix-a-couple-of-places-where-f-printf-parameters-wer.patch
- 0003-Initialize-pseudo-random-number-generator-with-curre.patch
- 0004-EXSLT-function-str-replace-is-broken-as-is.patch
- 0006-Fix-str-padding-to-work-with-UTF-8-strings.patch
- 0007-Separate-function-for-predicate-matching-in-patterns.patch
- 0008-Fix-direct-pattern-matching.patch
- 0009-Fix-certain-patterns-with-predicates.patch
- 0010-Fix-handling-of-UTF-8-strings-in-EXSLT-crypto-module.patch
- 0013-Memory-leak-in-xsltCompileIdKeyPattern-error-path.patch
- 0014-Fix-for-bug-436589.patch
- 0015-Fix-mkdir-for-mingw.patch
compiled: 2.9.2
loaded: 2.9.2
Is there any update on this issue?
I also think that this test https://github.com/sparklemotion/nokogiri/blob/027a151102083258f5e297c4fecfddca56904a2a/test/xml/test_entity_reference.rb#L175 is not correct, since the block passed to Nokogiri::XML::SAX::Parser.new
gets ignored. You have to pass it to #parse
in order to get access to Nokogiri::XML::SAX::ParserContext
. As @kostya reported it still has no effect though :-/
# Nokogiri (1.6.7.2)
---
warnings: []
nokogiri: 1.6.7.2
ruby:
version: 2.3.0
platform: x86_64-linux-gnu
description: ruby 2.3.0p0 (2015-12-25 revision 53290) [x86_64-linux-gnu]
engine: ruby
libxml:
binding: extension
source: packaged
libxml2_path: "/var/fejo-dk/fejo-next/vendor/bundle/ruby/2.3.0/gems/nokogiri-1.6.7.2/ports/x86_64-pc-linux-gnu/libxml2/2.9.2"
libxslt_path: "/var/fejo-dk/fejo-next/vendor/bundle/ruby/2.3.0/gems/nokogiri-1.6.7.2/ports/x86_64-pc-linux-gnu/libxslt/1.1.28"
libxml2_patches:
- 0001-Revert-Missing-initialization-for-the-catalog-module.patch
- 0002-Fix-missing-entities-after-CVE-2014-3660-fix.patch
- 0003-Stop-parsing-on-entities-boundaries-errors.patch
- 0004-Cleanup-conditional-section-error-handling.patch
- 0005-CVE-2015-1819-Enforce-the-reader-to-run-in-constant-.patch
- 0006-Another-variation-of-overflow-in-Conditional-section.patch
- 0007-Fix-an-error-in-previous-Conditional-section-patch.patch
- 0008-CVE-2015-8035-Fix-XZ-compression-support-loop.patch
- 0009-Updated-config.guess.patch
- 0010-Fix-parsering-short-unclosed-comment-uninitialized-access.patch
- 0011-Avoid-extra-processing-of-MarkupDecl-when-EOF.patch
- 0012-Avoid-processing-entities-after-encoding-conversion-.patch
- 0013-CVE-2015-7497-Avoid-an-heap-buffer-overflow-in-xmlDi.patch
- 0014-CVE-2015-5312-Another-entity-expansion-issue.patch
- 0015-Add-xmlHaltParser-to-stop-the-parser.patch
- 0016-Detect-incoherency-on-GROW.patch
- 0017-CVE-2015-7500-Fix-memory-access-error-due-to-incorre.patch
- 0018-CVE-2015-8242-Buffer-overead-with-HTML-parser-in-pus.patch
- 0019-Do-not-print-error-context-when-there-is-none.patch
- 0020-xmlStopParser-reset-errNo.patch
- 0021-Reuse-xmlHaltParser-where-it-makes-sense.patch
libxslt_patches:
- 0001-Adding-doc-update-related-to-1.1.28.patch
- 0002-Fix-a-couple-of-places-where-f-printf-parameters-wer.patch
- 0003-Initialize-pseudo-random-number-generator-with-curre.patch
- 0004-EXSLT-function-str-replace-is-broken-as-is.patch
- 0006-Fix-str-padding-to-work-with-UTF-8-strings.patch
- 0007-Separate-function-for-predicate-matching-in-patterns.patch
- 0008-Fix-direct-pattern-matching.patch
- 0009-Fix-certain-patterns-with-predicates.patch
- 0010-Fix-handling-of-UTF-8-strings-in-EXSLT-crypto-module.patch
- 0013-Memory-leak-in-xsltCompileIdKeyPattern-error-path.patch
- 0014-Fix-for-bug-436589.patch
- 0015-Fix-mkdir-for-mingw.patch
- 0016-Fix-for-type-confusion-in-preprocessing-attributes.patch
- 0017-Updated-config.guess.patch
compiled: 2.9.2
loaded: 2.9.2
I do also experience this issue with Nokogiri 1.6.8. Here's how I reproduce in irb -r nokogiri
:
class Parser < Nokogiri::XML::SAX::Document;def characters(v); p v; end;end
Nokogiri::HTML::SAX::Parser.new(Parser.new).parse("’"){|ctx| ctx.replace_entities = false}
# I get "\u0092" rather than "’"
This is causing some issues later in our application. I'd love to get this bug fixed. Right now the work-around I'm using is to .gsub /&(#\d{3};)/, '!!\1'
and then gsub again after processed which fixes my specific case but it is a fragile solution.
If you want a failing test, just replace these lines:
https://github.com/sparklemotion/nokogiri/blob/master/test/html/sax/test_parser.rb#L108 https://github.com/sparklemotion/nokogiri/blob/master/test/html/sax/test_parser.rb#L126
So that replace_entities
is set to false
and expect 'daddy & me
' rather than 'daddy & me
'.
Currently, setting replace_entities
does absolutely nothing no matter what you set it to.
As part of my work in #1926, I investigated entity handling and documented its behavior in #3265. Here's a screenshot of the docs:
As you can see, character references and predefined entities will always result in a callback to #characters
with the replacement text. This is just how the underlying parsers (libxml2 and xerces) behave.
Sorry it took so long to explain this, and I hope this is helpful.
1.6.6.2