sparklemotion / nokogiri

Nokogiri (鋸) makes it easy and painless to work with XML and HTML from Ruby.
https://nokogiri.org/
MIT License
6.15k stars 897 forks source link

html sax parser not fixing not closing attributes #1286

Closed kostya closed 9 years ago

kostya commented 9 years ago

1.6.6.2

require 'bundler/setup'
require 'nokogiri'

class Doc < Nokogiri::XML::SAX::Document
  def characters(chars)
    p chars
  end
end

str = "<html>
<meta name = 'bla
<body>
body
</body>
</html>
"

parser = Nokogiri::HTML::SAX::Parser.new(Doc.new)
parser.parse_memory(str)

output is empty.

but if i change meta to (add '):

<meta name = 'bla'

nokogiri fixed broken html tags, and output is "\nbody\n"

i think nokogiri also should fix broken html tags in first example;

flavorjones commented 9 years ago

Hi,

Thank for reporting this.

Can you provide the output from nokogiri -v so that we know what your environment looks like?

-m On May 10, 2015 11:56 AM, "kostya" notifications@github.com wrote:

1.6.6.2

require 'bundler/setup'require 'nokogiri' class Doc < Nokogiri::XML::SAX::Document def characters(chars) p chars endend

str = "<meta name = 'blabody"

parser = Nokogiri::HTML::SAX::Parser.new(Doc.new) parser.parse_memory(str)

output is empty.

but if i change meta to:

<meta name = 'bla'

output is "\nbody\n"

i think nokogiri also should fix first example;

— Reply to this email directly or view it on GitHub https://github.com/sparklemotion/nokogiri/issues/1286.

kostya commented 9 years ago
# Nokogiri (1.6.6.2)
    ---
    warnings: []
    nokogiri: 1.6.6.2
    ruby:
      version: 2.2.0
      platform: x86_64-darwin13
      description: ruby 2.2.0p0 (2014-12-25 revision 49005) [x86_64-darwin13]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/Users/kostya/.rbenv/versions/2.2.0/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.6.2/ports/x86_64-apple-darwin13.0.0/libxml2/2.9.2"
      libxslt_path: "/Users/kostya/.rbenv/versions/2.2.0/lib/ruby/gems/2.2.0/gems/nokogiri-1.6.6.2/ports/x86_64-apple-darwin13.0.0/libxslt/1.1.28"
      libxml2_patches:
      - 0001-Revert-Missing-initialization-for-the-catalog-module.patch
      - 0002-Fix-missing-entities-after-CVE-2014-3660-fix.patch
      libxslt_patches:
      - 0001-Adding-doc-update-related-to-1.1.28.patch
      - 0002-Fix-a-couple-of-places-where-f-printf-parameters-wer.patch
      - 0003-Initialize-pseudo-random-number-generator-with-curre.patch
      - 0004-EXSLT-function-str-replace-is-broken-as-is.patch
      - 0006-Fix-str-padding-to-work-with-UTF-8-strings.patch
      - 0007-Separate-function-for-predicate-matching-in-patterns.patch
      - 0008-Fix-direct-pattern-matching.patch
      - 0009-Fix-certain-patterns-with-predicates.patch
      - 0010-Fix-handling-of-UTF-8-strings-in-EXSLT-crypto-module.patch
      - 0013-Memory-leak-in-xsltCompileIdKeyPattern-error-path.patch
      - 0014-Fix-for-bug-436589.patch
      - 0015-Fix-mkdir-for-mingw.patch
      compiled: 2.9.2
      loaded: 2.9.2
flavorjones commented 9 years ago

Hi @kostya,

Apologies for not responding sooner. In the first example, libxml2 considers everything after the ' character to be an unclosed string, not an unclosed tag.

Nokogiri is limited in how broken markup is corrected by its underlying libraries (libxml2 for MRI or xerces for JRuby), and unfortunately there's nothing we can do without drastically invasive changes.

Sorry we can't help you in this situation.