microformats / microformats-ruby

Ruby gem that parse HTML containing microformats/microformats2 and returns Ruby objects, a Ruby hash or a JSON hash
https://rubygems.org/gems/microformats
Creative Commons Zero v1.0 Universal
100 stars 29 forks source link

Properly handle nested objects #31

Closed aaronpk closed 7 years ago

aaronpk commented 10 years ago

It seems the parser is not handling nested objects properly.

For example, this URL: http://aaronparecki.com/notes/2014/07/04/2/indiewebcamp-latergram

It appears the comment authors and comment URLs all show up under the main h-entry when in reality they should be under children of the main h-entry as their own h-cite objects.

Compare the result of the PHP parser

mmitchellg5 commented 10 years ago

i believe this is related to this section from microformats wiki:

http://microformats.org/wiki/microformats-2 Quote:

FOR PARSERS ONLY:

Without a property class name like 'p-org' holding all the nested objects together, we need to introduce >another array for nested children (similar to the existing DOM element notion of children) of a >microformat that are not attached to a specific property:

Parsed JSON:

{ "items": [{ "type": ["h-card"], "properties": { "name": ["Mitchell Baker"], "url": ["http://blog.lizardwrangler.com/"] }, "children": [{ "type": ["h-card","h-org"], "properties": { "name": ["Mozilla Foundation"], "url": ["http://mozilla.org/"] }
}] }] }

Since there's no property class name on the element with classes 'h-card' and 'h-org', the microformat representing that element is collected into the children array.

Such a nested microformat implies some relationship (containment, being related), but is not as useful as if the nested microformat was a specific property of its parent.

For this reason it's recommended that authors should not publish nested microformats without a property class name, and instead, when nesting microformats, authors should always specify a property class name (like 'p-org') on the same element as the root class name(s) of the nested microformat(s) (like 'h-card' and/or 'h-org').

which appears not yet implemented.

aaronpk commented 9 years ago

Any updates? I just had to do an ugly workaround for this, dropping down to use the to_hash version of the parsed data: https://github.com/aaronpk/webmention.io/commit/d2cc83613c2571cada747cd26010839a0841b7e5#diff-411ca3c70351e774091d525fab8264b9R330

mmitchellg5 commented 9 years ago

Not as of yet, unfortunately

On Sat, Dec 13, 2014 at 9:29 AM, Aaron Parecki notifications@github.com wrote:

Any updates? I just had to do an ugly workaround for this, dropping down to use the to_hash version of the parsed data: aaronpk/webmention.io@ d2cc836#diff-411ca3c70351e774091d525fab8264b9R330 https://github.com/aaronpk/webmention.io/commit/d2cc83613c2571cada747cd26010839a0841b7e5#diff-411ca3c70351e774091d525fab8264b9R330

— Reply to this email directly or view it on GitHub https://github.com/G5/microformats2/issues/31#issuecomment-66883959.

Michael Mitchell SOFTWARE ENGINEER

[image: G5 Website] http://www.getg5.com/ DIGITAL EXPERIENCE MANAGEMENT www.GetG5.com http://www.getg5.com/ T 541.306.3374

FOLLOW US http://www.getg5.com/

https://plus.google.com/u/0/101198449642176712699/about

http://www.linkedin.com/company/getg5

https://twitter.com/G5Platform

http://www.getg5.com/blog/

https://www.facebook.com/GetG5

This email may contain information that is privileged, confidential, or proprietary and is intended solely for the named addressee. If you are not the addressee, or if it appears that you have received this email in error, please advise me immediately by reply email, do not disclose, copy, or distribute the contents, and immediately delete the message and any attachments from your system. Thank you.

jeena commented 9 years ago

I tried to parse http://tantek.com/ and it crashed the parser on something like this:

<div class="h-entry">
  <p class="u-comment h-cite">
    test .
  </p>
</div>

with:

URI::InvalidURIError: bad URI(is not URI?): test .

It would be nice if the parser at least wouldn't crash.

veganstraightedge commented 9 years ago

@jeena It seems like the parser is pretty much abandoned by the G5 folks. We got paid to build it (when I was still there) as a part of a larger product. But if it does as much as G5 needs and they're otherwise too busy, it's not likely to get the attention it deserves.

If you're able and willing to submit a pull request with a patch, I could apply it for you.

Unfortunately, (like too many open source projects) this doesn't really have a maintainer anymore. 😕

jeena commented 9 years ago

Oh ok, that's sad, but understandable. Perhaps one could write that somewhere into the README so people know and perhaps someone will be able to take it over. I'm not sure I will be able to fix something like this but if then I will make a pull request.

veganstraightedge commented 7 years ago

@jeena @aaronpk I believe this is fixed in 3.0. I just did a test and it looks good to me. Please upgrade and run your comparison too. Re-open this issue if necessary.