sparklemotion / nokogiri

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.
https://nokogiri.org/
MIT License
6.14k stars 896 forks source link

sax parser with replace_entities = false still replaces entities #1284

Closed kostya closed 2 months ago

kostya commented 9 years ago

1.6.6.2

require 'bundler/setup'
require 'nokogiri'

class Doc < Nokogiri::XML::SAX::Document
  def characters(chars)
    p chars
  end
end

handler = Doc.new
parser = Nokogiri::HTML::SAX::Parser.new(handler)
parser.parse_memory("<body> &#61 - &#8211; &amp; &nbsp; </body>") do |ctx|
  ctx.replace_entities = false
end
" "
"="
" - "
"–"
" "
"&"
" "
" "
" "
flavorjones commented 9 years ago

Hi,

Thank for reporting this.

Can you provide the output from nokogiri -v so that we know what your environment looks like?

-m On May 10, 2015 11:32 AM, "kostya" notifications@github.com wrote:

1.6.6.2

require 'bundler/setup'require 'nokogiri' class Doc < Nokogiri::XML::SAX::Document def characters(chars) p chars endend

handler = Doc.new parser = Nokogiri::HTML::SAX::Parser.new(handler) parser.parse_memory(" &#61 - – & ") do |ctx| ctx.replace_entities = falseend

" " "=" " - " "–" " " "&" " "

— Reply to this email directly or view it on GitHub https://github.com/sparklemotion/nokogiri/issues/1284.

kostya commented 9 years ago
# Nokogiri (1.6.6.2)
    ---
    warnings: []
    nokogiri: 1.6.6.2
    ruby:
      version: 2.2.0
      platform: x86_64-darwin13
      description: ruby 2.2.0p0 (2014-12-25 revision 49005) [x86_64-darwin13]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/Users/kostya/.rbenv/versions/2.2.0/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.6.2/ports/x86_64-apple-darwin13.0.0/libxml2/2.9.2"
      libxslt_path: "/Users/kostya/.rbenv/versions/2.2.0/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.6.2/ports/x86_64-apple-darwin13.0.0/libxslt/1.1.28"
      libxml2_patches:
      - 0001-Revert-Missing-initialization-for-the-catalog-module.patch
      - 0002-Fix-missing-entities-after-CVE-2014-3660-fix.patch
      libxslt_patches:
      - 0001-Adding-doc-update-related-to-1.1.28.patch
      - 0002-Fix-a-couple-of-places-where-f-printf-parameters-wer.patch
      - 0003-Initialize-pseudo-random-number-generator-with-curre.patch
      - 0004-EXSLT-function-str-replace-is-broken-as-is.patch
      - 0006-Fix-str-padding-to-work-with-UTF-8-strings.patch
      - 0007-Separate-function-for-predicate-matching-in-patterns.patch
      - 0008-Fix-direct-pattern-matching.patch
      - 0009-Fix-certain-patterns-with-predicates.patch
      - 0010-Fix-handling-of-UTF-8-strings-in-EXSLT-crypto-module.patch
      - 0013-Memory-leak-in-xsltCompileIdKeyPattern-error-path.patch
      - 0014-Fix-for-bug-436589.patch
      - 0015-Fix-mkdir-for-mingw.patch
      compiled: 2.9.2
      loaded: 2.9.2
tisba commented 8 years ago

Is there any update on this issue?

I also think that this test https://github.com/sparklemotion/nokogiri/blob/027a151102083258f5e297c4fecfddca56904a2a/test/xml/test_entity_reference.rb#L175 is not correct, since the block passed to Nokogiri::XML::SAX::Parser.new gets ignored. You have to pass it to #parse in order to get access to Nokogiri::XML::SAX::ParserContext. As @kostya reported it still has no effect though :-/

# Nokogiri (1.6.7.2)
    ---
    warnings: []
    nokogiri: 1.6.7.2
    ruby:
      version: 2.3.0
      platform: x86_64-linux-gnu
      description: ruby 2.3.0p0 (2015-12-25 revision 53290) [x86_64-linux-gnu]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/var/fejo-dk/fejo-next/vendor/bundle/ruby/2.3.0/gems/nokogiri-1.6.7.2/ports/x86_64-pc-linux-gnu/libxml2/2.9.2"
      libxslt_path: "/var/fejo-dk/fejo-next/vendor/bundle/ruby/2.3.0/gems/nokogiri-1.6.7.2/ports/x86_64-pc-linux-gnu/libxslt/1.1.28"
      libxml2_patches:
      - 0001-Revert-Missing-initialization-for-the-catalog-module.patch
      - 0002-Fix-missing-entities-after-CVE-2014-3660-fix.patch
      - 0003-Stop-parsing-on-entities-boundaries-errors.patch
      - 0004-Cleanup-conditional-section-error-handling.patch
      - 0005-CVE-2015-1819-Enforce-the-reader-to-run-in-constant-.patch
      - 0006-Another-variation-of-overflow-in-Conditional-section.patch
      - 0007-Fix-an-error-in-previous-Conditional-section-patch.patch
      - 0008-CVE-2015-8035-Fix-XZ-compression-support-loop.patch
      - 0009-Updated-config.guess.patch
      - 0010-Fix-parsering-short-unclosed-comment-uninitialized-access.patch
      - 0011-Avoid-extra-processing-of-MarkupDecl-when-EOF.patch
      - 0012-Avoid-processing-entities-after-encoding-conversion-.patch
      - 0013-CVE-2015-7497-Avoid-an-heap-buffer-overflow-in-xmlDi.patch
      - 0014-CVE-2015-5312-Another-entity-expansion-issue.patch
      - 0015-Add-xmlHaltParser-to-stop-the-parser.patch
      - 0016-Detect-incoherency-on-GROW.patch
      - 0017-CVE-2015-7500-Fix-memory-access-error-due-to-incorre.patch
      - 0018-CVE-2015-8242-Buffer-overead-with-HTML-parser-in-pus.patch
      - 0019-Do-not-print-error-context-when-there-is-none.patch
      - 0020-xmlStopParser-reset-errNo.patch
      - 0021-Reuse-xmlHaltParser-where-it-makes-sense.patch
      libxslt_patches:
      - 0001-Adding-doc-update-related-to-1.1.28.patch
      - 0002-Fix-a-couple-of-places-where-f-printf-parameters-wer.patch
      - 0003-Initialize-pseudo-random-number-generator-with-curre.patch
      - 0004-EXSLT-function-str-replace-is-broken-as-is.patch
      - 0006-Fix-str-padding-to-work-with-UTF-8-strings.patch
      - 0007-Separate-function-for-predicate-matching-in-patterns.patch
      - 0008-Fix-direct-pattern-matching.patch
      - 0009-Fix-certain-patterns-with-predicates.patch
      - 0010-Fix-handling-of-UTF-8-strings-in-EXSLT-crypto-module.patch
      - 0013-Memory-leak-in-xsltCompileIdKeyPattern-error-path.patch
      - 0014-Fix-for-bug-436589.patch
      - 0015-Fix-mkdir-for-mingw.patch
      - 0016-Fix-for-type-confusion-in-preprocessing-attributes.patch
      - 0017-Updated-config.guess.patch
      compiled: 2.9.2
      loaded: 2.9.2
rosenfeld commented 8 years ago

I do also experience this issue with Nokogiri 1.6.8. Here's how I reproduce in irb -r nokogiri:

class Parser < Nokogiri::XML::SAX::Document;def characters(v); p v; end;end
Nokogiri::HTML::SAX::Parser.new(Parser.new).parse("&#146;"){|ctx| ctx.replace_entities = false}
# I get "\u0092" rather than "&#146;"

This is causing some issues later in our application. I'd love to get this bug fixed. Right now the work-around I'm using is to .gsub /&(#\d{3};)/, '!!\1' and then gsub again after processed which fixes my specific case but it is a fragile solution.

rosenfeld commented 8 years ago

If you want a failing test, just replace these lines:

https://github.com/sparklemotion/nokogiri/blob/master/test/html/sax/test_parser.rb#L108 https://github.com/sparklemotion/nokogiri/blob/master/test/html/sax/test_parser.rb#L126

So that replace_entities is set to false and expect 'daddy &amp; me' rather than 'daddy & me'.

Currently, setting replace_entities does absolutely nothing no matter what you set it to.

flavorjones commented 2 months ago

As part of my work in #1926, I investigated entity handling and documented its behavior in #3265. Here's a screenshot of the docs:

image

As you can see, character references and predefined entities will always result in a callback to #characters with the replacement text. This is just how the underlying parsers (libxml2 and xerces) behave.

Sorry it took so long to explain this, and I hope this is helpful.