glennjones / microformat-node

Microformats parser for node.js
http://glennjones.net/tools/microformats/
MIT License
116 stars 20 forks source link

p-name breaks on empty text #22

Closed notenoughneon closed 9 years ago

notenoughneon commented 9 years ago

The example below is not parsing correctly. I would expect the entry "name" to be the empty string. Adding any non-whitespace text to the e-content causes it to revert to expected behavior.

<!DOCTYPE html>
<html lang="en">
<head>
</head>
<body>
    <div class="h-entry">
        <a href="http://this.site/photo" class="u-url"></a>
        <div class="e-content p-name"><img src="photo.jpg" class="u-photo"/></div>

        Some extraneous text

        <div class="h-cite">
            <a href="http://someother.site/like" class="u-url"></a>
            <a href="http://this.site/photo" class="u-like-of"></a>
            <div class="e-content p-name">liked this</div>
        </div>
    </div>
</body>
</html>
{ items: 
   [ { type: [ 'h-entry' ],
       properties: 
        { url: [ 'http://this.site/photo' ],
          content: [ { value: '', html: '<img src="photo.jpg" class="u-photo" />' } ],
          photo: [ 'photo.jpg' ],
          name: [ 'Some extraneous text\r\n\r\n        \r\n            \r\n            \r\n            liked this' ] },
       children: 
        [ { value: 'liked this',
            type: [ 'h-cite' ],
            properties: 
             { url: [ 'http://someother.site/like' ],
               'like-of': [ 'http://this.site/photo' ],
               content: [ { value: 'liked this', html: 'liked this' } ],
               name: [ 'liked this' ] } } ] } ],
  rels: {},
  'rel-urls': {} }
glennjones commented 9 years ago

Hi Emma

The parsing rules I am following here are:

If a property (p-name) is empty do not add it to the output. In this case "empty" is classed as not containing any non-whitespace text. As far as I known there is no guidance on how to handle "empty" properties in microfomats paring rules, so I followed the conventions of JSON API's not to return "empty" properties.

The side effect of the above is that p-name also has a number of "implied rules". The "implied rules" try to automatically fill properties like p-name if there is no defined value. In your example it uses the full text content of the parent h-entry.

I can see why the resulting different outputs would seem a little unexpected.

This is not really a bug, but a valid questions about how the parsing rules should work:

I am going to have to post this issue to microfomats IRC and see if we can define the rules a bit more clearly for your use case. Once we have an agreed approached I can update the parser.

Personally I would recommend to any author of microformats to always add a p-name with some text to every h-*. Also with my parser try setting the options to {'textFormat': 'normalised'} you may find the resulting text more useful.

notenoughneon commented 9 years ago

Thanks for clarifying. It sounds like the implied property rule breaks the "note type algorithm" and recommended practice of including a p-name in notes, if the note happens to be a photo with no text. Should this be documented on http://microformats.org/wiki/microformats2-parsing-issues?

glennjones commented 9 years ago

I have add this problem to http://microformats.org/wiki/microformats2-parsing-issues. Sorry its taken a little while, but I wanted to go through my code carefully to make sure it was not an issue with my parser.

Please feel free to add your own view on how this should be dealt with to the wiki page.

kylewm commented 9 years ago

Interesting question! mf2py and php-mf2 will both happily include empty string values in the output and not generate an implied name. Added comments/votes to the wiki.

glennjones commented 9 years ago

Hi Emma, Its been a while but your view of what the output should be got agreed and I have now updated all my javascript parser code. You can try out the html in your example in http://glennjones.net/tools/microformats and the p-name should now return as an empty string.