Closed EdwardDiehl closed 7 years ago
I'll look at it tonight if you can provide a sample HTML that is failing. You should take a look at the sax_html method though.
@ohler55 here the sample
<img alt=\"Logo.gif?ixlib=rails-2.1\" src=\"https://we-work-remotely-production.imgix.net/logos/0001/5826/logo.gif?ixlib=rails-2.1.3&w=190&min-h=150\">\n\n<p>\n <strong>Headquarters:</strong> New Haven, Connecticut\n <br><strong>URL:</strong> <a href=\"http://aloe.ai\" target=\"_blank\">http://aloe.ai</a>\n</p>\n\n<div><b>Attn: Innovative Software Engineers</b></div>\n<div><br></div>\n<div>We invented a software assistant that improves productivity by helping professionals gain perfect recall and execute more crisply. We are looking for software engineers to help turn this ambitious vision into a great product.</div>\n<div><br></div>\n<div><b>Responsibilities</b></div>\n<div><br></div>\n<ul>\n<li>Work as a core member of our founding team on the invention and refinement of Aloe<br>\n</li>\n<li>Build new features</li>\n<li>Contribute best-in-class programming skills to develop a highly innovative, consumer-quality product for professionals and businesses<br>\n</li>\n</ul>\n<div><b><br></b></div>\n<div><b>Minimum Qualifications</b></div>\n<div><br></div>\n<ul>\n<li>B.S. or M.S. Computer Science or 4+ years in relevant work experience</li>\n<li>3+ years of object-oriented software development experience</li>\n<li>2+ years building mobile apps</li>\n<li>Experience in understanding code bases, including API design techniques to help keep them clean and maintainable</li>\n<li>Experience in the following technologies: ReactJS, React Native, Ruby</li>\n<li>Experience with relational and object databases </li>\n<li>Knowledge of UI design principles and Agile SDLC</li>\n</ul>\n\n<p><strong>To apply:</strong> Please use the \"Join Team\" button on our website:\n\n<a href=\"http://aloe.ai\" target=\"_blank\">http://aloe.ai</a></p>
No errors when I ran it. Did you have the \ in the file or is that an artifact of printing?
handler = AllSax.new()
options = {
:symbolize => true,
:skip => :skip_white,
:smart => true
}
input = StringIO.new(%|<img alt="Logo.gif?ixlib=rails-2.1" src="https://we-work-remotely-production.imgix.net/logos/0001/5826/logo.gif?ixlib=rails-2.1.3&w=190&min-h=150">\n\n<p>\n <strong>Headquarters:</strong> New Haven, Connecticut\n <br><strong>URL:</strong> <a href="http://aloe.ai" target="_blank">http://aloe.ai</a>\n</p>\n\n<div><b>Attn: Innovative Software Engineers</b></div>\n<div><br></div>\n<div>We invented a software assistant that improves productivity by helping professionals gain perfect recall and execute more crisply. We are looking for software engineers to help turn this ambitious vision into a great product.</div>\n<div><br></div>\n<div><b>Responsibilities</b></div>\n<div><br></div>\n<ul>\n<li>Work as a core member of our founding team on the invention and refinement of Aloe<br>\n</li>\n<li>Build new features</li>\n<li>Contribute best-in-class programming skills to develop a highly innovative, consumer-quality product for professionals and businesses<br>\n</li>\n</ul>\n<div><b><br></b></div>\n<div><b>Minimum Qualifications</b></div>\n<div><br></div>\n<ul>\n<li>B.S. or M.S. Computer Science or 4+ years in relevant work experience</li>\n<li>3+ years of object-oriented software development experience</li>\n<li>2+ years building mobile apps</li>\n<li>Experience in understanding code bases, including API design techniques to help keep them clean and maintainable</li>\n<li>Experience in the following technologies: ReactJS, React Native, Ruby</li>\n<li>Experience with relational and object databases </li>\n<li>Knowledge of UI design principles and Agile SDLC</li>\n</ul>\n\n<p><strong>To apply:</strong> Please use the "Join Team" button on our website:\n\n<a href="http://aloe.ai" target="_blank">http://aloe.ai</a></p>|)
Ox.sax_html(handler, input, options)
Did you have the \ in the file or is that an artifact of printing?
Yes, i had it in the string.
@ohler55 thank you very much for the example, it works by me as well.
So for parsing html i should use Ox.sax_html
with sax handler, but not Ox.parse
You can use the regular sax parser but you would need to set up the hint or overlay. The sax_html does that for you.
Hi! i tried to parse html with settings specified in documentation http://www.ohler.com/ox/#label-HTML+Parsing-3A but it returns me an error.
Some simple example:
Could you provide some example how to parse html, my initial goal is to extract plain text from html, maybe you can give some working example here or on SO or point me to documentation.
http://stackoverflow.com/questions/43375214/how-to-extract-plain-text-from-html-markup-in-ruby-with-help-of-ox-gem