ohler55 / ox

Ruby Optimized XML Parser
http://www.ohler.com/ox
MIT License
900 stars 76 forks source link

Ox::ParseError: invalid format on parsing html #169

Closed EdwardDiehl closed 7 years ago

EdwardDiehl commented 7 years ago

Hi! i tried to parse html with settings specified in documentation http://www.ohler.com/ox/#label-HTML+Parsing-3A but it returns me an error.

Some simple example:

irb(main):001:0> Ox.default_options = { mode: :generic, effort: :tolerant, smart: true }
=> {:mode=>:generic, :effort=>:tolerant, :smart=>true}
irb(main):002:0> Ox.parse('<img src="logo.png" alt="logo">')
Ox::ParseError: invalid format, document not terminated at line 1, column 33 [parse.c:521]

Could you provide some example how to parse html, my initial goal is to extract plain text from html, maybe you can give some working example here or on SO or point me to documentation.

http://stackoverflow.com/questions/43375214/how-to-extract-plain-text-from-html-markup-in-ruby-with-help-of-ox-gem

ohler55 commented 7 years ago

I'll look at it tonight if you can provide a sample HTML that is failing. You should take a look at the sax_html method though.

EdwardDiehl commented 7 years ago

@ohler55 here the sample

<img alt=\"Logo.gif?ixlib=rails-2.1\" src=\"https://we-work-remotely-production.imgix.net/logos/0001/5826/logo.gif?ixlib=rails-2.1.3&amp;w=190&amp;min-h=150\">\n\n<p>\n  <strong>Headquarters:</strong> New Haven, Connecticut\n    <br><strong>URL:</strong> <a href=\"http://aloe.ai\" target=\"_blank\">http://aloe.ai</a>\n</p>\n\n<div><b>Attn: Innovative Software Engineers</b></div>\n<div><br></div>\n<div>We invented a software assistant that improves productivity by helping professionals gain perfect recall and execute more crisply. We are looking for software engineers to help turn this ambitious vision into a great product.</div>\n<div><br></div>\n<div><b>Responsibilities</b></div>\n<div><br></div>\n<ul>\n<li>Work as a core member of our founding team on the invention and refinement of Aloe<br>\n</li>\n<li>Build new features</li>\n<li>Contribute best-in-class programming skills to develop a highly innovative, consumer-quality product for professionals and businesses<br>\n</li>\n</ul>\n<div><b><br></b></div>\n<div><b>Minimum Qualifications</b></div>\n<div><br></div>\n<ul>\n<li>B.S. or M.S. Computer Science or 4+ years in relevant work experience</li>\n<li>3+ years of object-oriented software development experience</li>\n<li>2+ years building mobile apps</li>\n<li>Experience in understanding code bases, including API design techniques to help keep them clean and maintainable</li>\n<li>Experience in the following technologies: ReactJS, React Native, Ruby</li>\n<li>Experience with relational and object databases </li>\n<li>Knowledge of UI design principles and Agile SDLC</li>\n</ul>\n\n<p><strong>To apply:</strong> Please use the \"Join Team\" button on our website:\n\n<a href=\"http://aloe.ai\" target=\"_blank\">http://aloe.ai</a></p>
ohler55 commented 7 years ago

No errors when I ran it. Did you have the \ in the file or is that an artifact of printing?

    handler = AllSax.new()
    options = {
      :symbolize => true,
      :skip => :skip_white,
      :smart => true
    }
    input = StringIO.new(%|<img alt="Logo.gif?ixlib=rails-2.1" src="https://we-work-remotely-production.imgix.net/logos/0001/5826/logo.gif?ixlib=rails-2.1.3&amp;w=190&amp;min-h=150">\n\n<p>\n  <strong>Headquarters:</strong> New Haven, Connecticut\n    <br><strong>URL:</strong> <a href="http://aloe.ai" target="_blank">http://aloe.ai</a>\n</p>\n\n<div><b>Attn: Innovative Software Engineers</b></div>\n<div><br></div>\n<div>We invented a software assistant that improves productivity by helping professionals gain perfect recall and execute more crisply. We are looking for software engineers to help turn this ambitious vision into a great product.</div>\n<div><br></div>\n<div><b>Responsibilities</b></div>\n<div><br></div>\n<ul>\n<li>Work as a core member of our founding team on the invention and refinement of Aloe<br>\n</li>\n<li>Build new features</li>\n<li>Contribute best-in-class programming skills to develop a highly innovative, consumer-quality product for professionals and businesses<br>\n</li>\n</ul>\n<div><b><br></b></div>\n<div><b>Minimum Qualifications</b></div>\n<div><br></div>\n<ul>\n<li>B.S. or M.S. Computer Science or 4+ years in relevant work experience</li>\n<li>3+ years of object-oriented software development experience</li>\n<li>2+ years building mobile apps</li>\n<li>Experience in understanding code bases, including API design techniques to help keep them clean and maintainable</li>\n<li>Experience in the following technologies: ReactJS, React Native, Ruby</li>\n<li>Experience with relational and object databases </li>\n<li>Knowledge of UI design principles and Agile SDLC</li>\n</ul>\n\n<p><strong>To apply:</strong> Please use the "Join Team" button on our website:\n\n<a href="http://aloe.ai" target="_blank">http://aloe.ai</a></p>|)

    Ox.sax_html(handler, input, options)
EdwardDiehl commented 7 years ago

Did you have the \ in the file or is that an artifact of printing?

Yes, i had it in the string. @ohler55 thank you very much for the example, it works by me as well. So for parsing html i should use Ox.sax_html with sax handler, but not Ox.parse

ohler55 commented 7 years ago

You can use the regular sax parser but you would need to set up the hint or overlay. The sax_html does that for you.