microformats / microformats2-parsing

For collecting and handling issues with the microformats2 parsing specification: http://microformats.org/wiki/microformats2-parsing
14 stars 6 forks source link

consider not including img alt text as part of surrounding text properties #16

Open aaronpk opened 6 years ago

aaronpk commented 6 years ago

This became more critical once I started working on removing images from posts containing a photo in XRay.

Given a post like tantek's, which starts with two imgs inside the e-content both containing alt text, the parser results in the name and content properties both beginning with the img alt text. When XRay goes to remove the img tags from the HTML, it should have also removed the img's alt text from the plaintext versions, but it can't since the mf2 parser has already included it.

Aside from that case, I have also heard several anecdotal cases where the alt text doesn't produce a good result for plaintext values.

The proposal is to consider not including img alt text as part of surrounding text properties. It would only be included as part of implied values and when there is a mf2 class on the img tag itself.

aaronpk commented 4 years ago

There is a case where it makes sense to leave the alt text as part of the text content, for example

<div class="e-content"><img src="fancy-A.png" alt="A">aron</div>

should be parsed as

  "content": {
    "html": "<img src=\"fancy-A.png\" alt=\"A\">aron",
    "value": "Aaron"
  }

so this is a case where the alt text should be left in.

I think the right rule is that if the parser has pulled out the alt text into a property (#2) then it should be removed from the text content. That would allow a consumer to put the post back together if it's consuming the alt text.

gRegorLove commented 1 year ago

I'm watching the 2020 microformats session and I think I like the proposed rule. I have examples similar to Tantek's.

Current parse (truncated):

"properties": {
    "photo": [
        {
            "value": "https://gregorlove.com/site/assets/files/6050/pxl_20210410_185435944.1000x0-is.jpg",
            "alt": "selfie wearing a black face mask and red sunglasses with flower decorations while sitting under an orange umbrella"
        }
    ],
    "content": [
        {
            "html": "<figure class=\"no-margin\"><img alt=\"selfie wearing a black face mask and red sunglasses with flower decorations while sitting under an orange umbrella\" class=\"u-photo wide-bleed-image\" src=\"https://gregorlove.com/site/assets/files/6050/pxl_20210410_185435944.1000x0-is.jpg\">\n<figcaption>\n<p>Pandemic chic from April 2021. Sunglasses borrowed from <a class=\"h-card\" href=\"https://www.instagram.com/lizlemonskinnypuppy/\">Laurie</a>.</p>\n</figcaption>\n</figure>",
            "value": "selfie wearing a black face mask and red sunglasses with flower decorations while sitting under an orange umbrella\nPandemic chic from April 2021. Sunglasses borrowed from Laurie.",
            "lang": "en"
        }
    ],
}

So in general, it would make sense to me if the parser removed that matching alt text and the result was:

"properties": {
    "photo": [
        {
            "value": "https://gregorlove.com/site/assets/files/6050/pxl_20210410_185435944.1000x0-is.jpg",
            "alt": "selfie wearing a black face mask and red sunglasses with flower decorations while sitting under an orange umbrella"
        }
    ],
    "content": [
        {
            "html": "<figure class=\"no-margin\"><img alt=\"selfie wearing a black face mask and red sunglasses with flower decorations while sitting under an orange umbrella\" class=\"u-photo wide-bleed-image\" src=\"https://gregorlove.com/site/assets/files/6050/pxl_20210410_185435944.1000x0-is.jpg\">\n<figcaption>\n<p>Pandemic chic from April 2021. Sunglasses borrowed from <a class=\"h-card\" href=\"https://www.instagram.com/lizlemonskinnypuppy/\">Laurie</a>.</p>\n</figcaption>\n</figure>",
            "value": "Pandemic chic from April 2021. Sunglasses borrowed from Laurie.",
            "lang": "en"
        }
    ],
}

I don't know the best spec language for this. Really rough attempt at an update for parsing an e-, prefixed lines with [new] and [adjusted]:

parsing an e- property

  • html ... [no change]
  • value: the textContent of the element after
    • dropping any nested Githubissues.
    • Githubissues is a development platform for aggregating issues.