soulcutter / saxerator

A SAX-based XML parser for parsing large files into manageable chunks
MIT License
128 stars 19 forks source link

Question about parsing HTML elements #1

Closed javiervegas closed 12 years ago

javiervegas commented 12 years ago

Hi, Bradley!

Great job on this, I was doing some similar XML parsing myself at work and decided to try your gem. Works great, there is just one minor thing that is confusing me. Let's say I have a collection of summary elements that I am traversing and processing. Each summary is a HTML element, like for example this real one:

<summary id=1071025>
                <p>Even the most jaded visitor may thrill in the Chinese's famous forecourt, where generations of screen legends have left their imprints in cement: feet, hands, dreadlocks (Whoopi Goldberg), and even magic wands (the young stars of the <em>Harry Potter</em> films). Actors dressed as Superman, Marilyn Monroe and the like pose for photos (for tips), and you may be offered free tickets to TV shows.</p>
                <p>The theater is on the <strong>Hollywood Walk of Fame</strong>, which honors over 2000 celebrities with stars embedded in the sidewalk. Other historic theaters include the flashy <a href="/pois/1130703/lang/en" class="poi inline"><name>El Capitan Theater</name></a> and the 1922 <a href="/pois/379895/lang/en" class="poi inline"><name>Egyptian Theater</name></a>, home to American Cinematheque, which presents arty retros and Q&amp;As with directors, writers and actors.</p>
              </summary>

Let's suppose what I want to do is update a database with the HTML content of each summary. When I am iterating through my document with @doc.for_tag(:summary).each do |summary|, the summaries are Saxerator::HashWithAttributes with the tree structure of the parsed HTML.

That was the setup, the question is: Is there a easy way to get back the HTML (something like summary.inner_html in Nokogiri), or to tell saxerator not to parse what is inside summary and treat it is as a string?

Other than that, Saxerator did an excellent job parsing some gnarly XML files!

Thanks again,

Javier

soulcutter commented 12 years ago

Thanks for submitting this. I think this is a use case where Saxerator is not-yet fully-baked. I'm still actively developing this gem, so I'll very likely use your example as a test case.

My current thinking is that I may move away from yielding a HashWithAttributes instance in favor of a class which extends a Nokogiri node class with one additional method, to_hash. That way you will be able to treat the document fragment as an xml document in its own right if you want to do something like you suggest. I have not fully thought this through yet, but it's something I'm kicking around.

To answer you question on whether or not there is an easy way to get back the HTML in your case... the answer is not right now. I hope to support this very soon - within a week I should have more to share on this.

If you have any control over the xml itself, maybe you could put it into a cdata block...

soulcutter commented 12 years ago

This week has been incredibly busy, so I haven't gotten back around to this. I have given it some thought, though, and I think my comment at the end about the cdata block is a good point - any html document fragments within an xml document really must be inside a cdata block.

I know that you may not have much control over the source document, and the solution I'm thinking of with Saxerator is to write an accumulator that uses Nokogiri's document/node classes, which would allow you to have a better representation for that use-case... however this is a bit on the back-burner right now since the main goal of this library is xml-to-hash conversion.

I'd be willing to accept a pull request (with tests please!) if you wanted to take a crack at it.