kartikprabhu commented 6 years ago

cc: @tantek @aaronpk @kevinmarks @gregorlove

The parsing spec is currently silent on how to handle <noscript> tags. Parsers seem to handle this is different ways

Particular issues

Should <noscript> tags be included while explicitly parsing p-* using textContent?
Should <noscript> tags be included in e-*[html]?
Should <noscript> tags be included in e-*[value]?
Should the content of <noscript> tags be included while parsing explicit p-* on the <noscript> tag?
Should properties nested inside a <noscript> tag be parsed?

All parsers tested below answer 2. with "yes".
Only phpmf2 answers 1. 3. 4 with a "no".
Only Go answers 5. with a "no".

HTML

<article class="h-entry">
<span class="p-name">Some name
<noscript>noscript in name</noscript>
</span>
<div class="e-content">
<span>This is some content</span>
<noscript>noscript in content</noscript>
</div>
<noscript class="p-summary">This is summary inside noscript</noscript>
<noscript><img class="u-photo" src="http://example.com"/></noscript>
</article>

Current Parser outputs

Ruby, Go testing from http://microformats.io

mf2py 1.1.0 (using default html5lib parser), Ruby parser 4.0.6,

"items": [
        {
            "type": [
                "h-entry"
            ], 
            "properties": {
                "content": [
                    {
                        "html": "<span>This is some content</span>\n<noscript>noscript in content</noscript>", 
                        "value": "This is some content\nnoscript in content"
                    }
                ], 
                "photo": [
                    "http://example.com"
                ], 
                "name": [
                    "Some name\nnoscript in name"
                ], 
                "summary": [
                    "This is summary inside noscript"
                ]
            }
        }
    ]

phpmf2

"items": [
        {
            "type": [
                "h-entry"
            ],
            "properties": {
                "name": [
                    "Some name"
                ],
                "summary": [
                    ""
                ],
                "photo": [
                    "http://example.com"
                ],
                "content": [
                    {
                        "html": "<span>This is some content</span>\r\n<noscript>noscript in content</noscript>",
                        "value": "This is some content"
                    }
                ]
            }
        }
    ]

Go

"items": [
    {
      "type": [
        "h-entry"
      ],
      "properties": {
        "content": [
          {
            "html": "\u003cspan\u003eThis is some content\u003c/span\u003e\n\u003cnoscript\u003enoscript in content\u003c/noscript\u003e",
            "value": "\nThis is some content\nnoscript in content\n"
          }
        ],
        "name": [
          "Some name\nnoscript in name"
        ],
        "summary": [
          "This is summary inside noscript"
        ]
      }
    }
  ]

sknebel commented 6 years ago

In-the-wild example of 5), properties nested below a <noscript> tag: @snarfed has an author h-card on his entries like

<span class="author p-author h-card vcard"> 
<img alt src="https://snarfed.org/…/1x1.trans.gif" class=""  data-lazy-src="https://secure.gravatar.com/…">

<noscript>
<img alt='' src='https://secure.gravatar.com/…' srcset='https://secure.gravatar.com… 2x' class='… u-photo'/>
</noscript>

<a class="u-url url fn n p-name" href="https://snarfed.org/" title="Ryan Barrett" rel="author">
Ryan Barrett</a></span>

Due to the image-lazyloading code, the photo URL isn't in the non-noscript markup as a src-property, and thus u-url is only marked up inside the <noscript>. I don't see a reason why parsers should not support this case and look inside the tag.

gRegorLove commented 6 years ago

Node output

from https://glennjones.net/tools/microformats/

{
    "items": [{
        "type": ["h-entry"],
        "properties": {
            "name": ["Some name\nnoscript in name"],
            "content": [{
                "value": "This is some content\nnoscript in content",
                "html": "\n<span>This is some content</span>\n<noscript>noscript in content</noscript>\n"
            }],
            "summary": ["This is summary inside noscript"],
            "photo": ["http://example.com"]
        }
    }],
    "rels": {},
    "rel-urls": {}
}

snarfed commented 6 years ago

thank you all for the in depth sleuthing! i noticed and dealt with this recently myself in https://github.com/snarfed/bridgy/issues/798 .

worth noting: noscript tag handling evidently depends on the underlying HTML parser, not mf2py itself. lxml returns noscript contents, html5lib ignores them.

kartikprabhu commented 6 years ago

@snarfed I don't think html5lib ignores them. The output for mf2py above was using html5lib and it does parser the contents of <noscript>. Maybe you meant the other way around i.e. lxml ignores the <noscript>?

snarfed commented 6 years ago

@kartikprabhu html5lib definitely ignored noscript in my testing a week ago. latest released version of mf2py afaik, Python 2.7. details in https://github.com/snarfed/bridgy/issues/798#issuecomment-370508015 . maybe differences in our environments, or because there was an earlier img without u-photo, so it used an implied rule first? who knows.

kartikprabhu commented 6 years ago

@snarfed aah! might be since I am using html5lib v 1.0.1 with the newly updated mf2py parser from my repo

sknebel commented 6 years ago

This was indeed changed in html5lib 0.99999999, which mf2py just jumped past, so this behaving differently now is expected.

Zegnat commented 6 years ago

Should <noscript> tags be included while explicitly parsing p-* using textContent?

These questions are interesting because noscript, like template is a little odd. If you assume scripting to be enabled the textContent of the noscript in <noscript><span>hi!</span></noscript> is not hi! but <span>hi!</span> which is probably not what the HTML author is expecting from the mf2 parser.

If we assume all mf2 parsers are operating where scripting is disabled, the noscript element basically acts the same in the DOM tree as an a element and I see no reason to ignore it. This makes sense to me, unless we can show this will lead to a lot of duplicated content in the wild.

Should <noscript> tags be included in e-*[html]?

Yes. There is nothing in the HTML fragment serialization algorithm linked to by the spec that excludes them.

Should <noscript> tags be included in e-*[value]?

I believe this should follow the same plaintext parsing as p-*. So this should match whatever is decided for question 4.

Should the content of <noscript> tags be included while parsing explicit p-* on the <noscript> tag?

I would say yes, but the contents of this textContent depends on what is decided in question 1. Again, I would lean towards the scripting disabled case which handles the noscript element (almost) no different from other elements.

Should properties nested inside a

microformats / microformats2-parsing

Parsing <noscript> tags #24

Particular issues

HTML

Current Parser outputs

mf2py 1.1.0 (using default html5lib parser), Ruby parser 4.0.6,

phpmf2

Go

Node output