microformats / microformats2-parsing

For collecting and handling issues with the microformats2 parsing specification: http://microformats.org/wiki/microformats2-parsing
14 stars 6 forks source link

Parsing <noscript> tags #24

Open kartikprabhu opened 6 years ago

kartikprabhu commented 6 years ago

cc: @tantek @aaronpk @kevinmarks @gregorlove

The parsing spec is currently silent on how to handle <noscript> tags. Parsers seem to handle this is different ways

Particular issues

  1. Should <noscript> tags be included while explicitly parsing p-* using textContent?
  2. Should <noscript> tags be included in e-*[html]?
  3. Should <noscript> tags be included in e-*[value]?
  4. Should the content of <noscript> tags be included while parsing explicit p-* on the <noscript> tag?
  5. Should properties nested inside a <noscript> tag be parsed?

HTML

<article class="h-entry">
<span class="p-name">Some name
<noscript>noscript in name</noscript>
</span>
<div class="e-content">
<span>This is some content</span>
<noscript>noscript in content</noscript>
</div>
<noscript class="p-summary">This is summary inside noscript</noscript>
<noscript><img class="u-photo" src="http://example.com"/></noscript>
</article>

Current Parser outputs

Ruby, Go testing from http://microformats.io

mf2py 1.1.0 (using default html5lib parser), Ruby parser 4.0.6,

"items": [
        {
            "type": [
                "h-entry"
            ], 
            "properties": {
                "content": [
                    {
                        "html": "<span>This is some content</span>\n<noscript>noscript in content</noscript>", 
                        "value": "This is some content\nnoscript in content"
                    }
                ], 
                "photo": [
                    "http://example.com"
                ], 
                "name": [
                    "Some name\nnoscript in name"
                ], 
                "summary": [
                    "This is summary inside noscript"
                ]
            }
        }
    ]

phpmf2

"items": [
        {
            "type": [
                "h-entry"
            ],
            "properties": {
                "name": [
                    "Some name"
                ],
                "summary": [
                    ""
                ],
                "photo": [
                    "http://example.com"
                ],
                "content": [
                    {
                        "html": "<span>This is some content</span>\r\n<noscript>noscript in content</noscript>",
                        "value": "This is some content"
                    }
                ]
            }
        }
    ]

Go

"items": [
    {
      "type": [
        "h-entry"
      ],
      "properties": {
        "content": [
          {
            "html": "\u003cspan\u003eThis is some content\u003c/span\u003e\n\u003cnoscript\u003enoscript in content\u003c/noscript\u003e",
            "value": "\nThis is some content\nnoscript in content\n"
          }
        ],
        "name": [
          "Some name\nnoscript in name"
        ],
        "summary": [
          "This is summary inside noscript"
        ]
      }
    }
  ]
sknebel commented 6 years ago

In-the-wild example of 5), properties nested below a <noscript> tag: @snarfed has an author h-card on his entries like

<span class="author p-author h-card vcard"> 
<img alt src="https://snarfed.org/…/1x1.trans.gif" class=""  data-lazy-src="https://secure.gravatar.com/…">

<noscript>
<img alt='' src='https://secure.gravatar.com/…' srcset='https://secure.gravatar.com… 2x' class='… u-photo'/>
</noscript>

<a class="u-url url fn n p-name" href="https://snarfed.org/" title="Ryan Barrett" rel="author">
Ryan Barrett</a></span>

Due to the image-lazyloading code, the photo URL isn't in the non-noscript markup as a src-property, and thus u-url is only marked up inside the <noscript>. I don't see a reason why parsers should not support this case and look inside the tag.

gRegorLove commented 6 years ago

Node output

from https://glennjones.net/tools/microformats/

{
    "items": [{
        "type": ["h-entry"],
        "properties": {
            "name": ["Some name\nnoscript in name"],
            "content": [{
                "value": "This is some content\nnoscript in content",
                "html": "\n<span>This is some content</span>\n<noscript>noscript in content</noscript>\n"
            }],
            "summary": ["This is summary inside noscript"],
            "photo": ["http://example.com"]
        }
    }],
    "rels": {},
    "rel-urls": {}
}
snarfed commented 6 years ago

thank you all for the in depth sleuthing! i noticed and dealt with this recently myself in https://github.com/snarfed/bridgy/issues/798 .

worth noting: noscript tag handling evidently depends on the underlying HTML parser, not mf2py itself. lxml returns noscript contents, html5lib ignores them.

kartikprabhu commented 6 years ago

@snarfed I don't think html5lib ignores them. The output for mf2py above was using html5lib and it does parser the contents of <noscript>. Maybe you meant the other way around i.e. lxml ignores the <noscript>?

snarfed commented 6 years ago

@kartikprabhu html5lib definitely ignored noscript in my testing a week ago. latest released version of mf2py afaik, Python 2.7. details in https://github.com/snarfed/bridgy/issues/798#issuecomment-370508015 . maybe differences in our environments, or because there was an earlier img without u-photo, so it used an implied rule first? who knows.

kartikprabhu commented 6 years ago

@snarfed aah! might be since I am using html5lib v 1.0.1 with the newly updated mf2py parser from my repo

sknebel commented 6 years ago

This was indeed changed in html5lib 0.99999999, which mf2py just jumped past, so this behaving differently now is expected.

Zegnat commented 6 years ago
  1. Should <noscript> tags be included while explicitly parsing p-* using textContent?

These questions are interesting because noscript, like template is a little odd. If you assume scripting to be enabled the textContent of the noscript in <noscript><span>hi!</span></noscript> is not hi! but <span>hi!</span> which is probably not what the HTML author is expecting from the mf2 parser.

If we assume all mf2 parsers are operating where scripting is disabled, the noscript element basically acts the same in the DOM tree as an a element and I see no reason to ignore it. This makes sense to me, unless we can show this will lead to a lot of duplicated content in the wild.

  1. Should <noscript> tags be included in e-*[html]?

Yes. There is nothing in the HTML fragment serialization algorithm linked to by the spec that excludes them.

  1. Should <noscript> tags be included in e-*[value]?

I believe this should follow the same plaintext parsing as p-*. So this should match whatever is decided for question 4.

  1. Should the content of <noscript> tags be included while parsing explicit p-* on the <noscript> tag?

I would say yes, but the contents of this textContent depends on what is decided in question 1. Again, I would lean towards the scripting disabled case which handles the noscript element (almost) no different from other elements.

  1. Should properties nested inside a

If we can agree that mf2 parsers operate where scripting is disabled (again, see question 1) and we are treating noscript like any other transparent element, then yes.

kartikprabhu commented 5 years ago

At IWS 2018 (https://indieweb.org/2018/microformats#parsing_.2324) It was accepted to treat <noscript> as a <div>.

By this proposal the answer to all five questions posed in the beginning (https://github.com/microformats/microformats2-parsing/issues/24#issue-304172814) should be "yes".

Maybe this should be made explicit in the spec. Here is a proposal to change the section http://microformats.org/wiki/microformats2-parsing#note_HTML_parsing_rules

Add the rule <noscript> elements are treated as if they are <div> elements