microlinkhq / metascraper

Get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more.
https://metascraper.js.org
MIT License
2.35k stars 168 forks source link

Enhancement: return image "alt text" #454

Closed jonpincus closed 3 years ago

jonpincus commented 3 years ago

Prerequisites

Subject of the issue

For accessibility purposes, I want to be able to show the alt text of images returned by metascraper. It would be great to upgrade the image rules bundle to provide this (since pretty much any situation where you need an image needs alt text) -- or alternatively have a separate rules bundle if that's the only way to do it.

Kikobeats commented 3 years ago

It should be shipped as a separate bundle of rules 🙂

jonpincus commented 3 years ago

Doing it as a separate bundle isn't ideal because the results are related. If the image isn't listed in the header tags but is returned from the body (for example with a rule like $('article img[src]') ) then the alt text needs to be taken from the same element. If that alement doesn't have alt text, then letting the search progress to other rules would lead to incorrect alt text being returned.

Kikobeats commented 3 years ago

I think that is just an implementation detail.

you can lookup for meta[property="og:image"] and also for meta[property="og:image:alt"] so you can correlate both values

jonpincus commented 3 years ago

I wrote a simple version that just gets it from og:image:alt and twitter:image:alt fields (if present) ... alas they're only there for about 1/3 of the pages I tried it on.

Trying to go farther, I'm not sure how to do these correlations -- the rules in the existing packages I looked at don't have any examples like selecting an element base on the output of a previous bundle. I want to say something like $('img[src=_][alt]')where _ is whatever the image bundle returned but am not sure how to go about it. [Although who knows how often the image from the meta fields actually shows up in the article contents.]

Kikobeats commented 3 years ago

If you want to check for non-empty values, I think it should be something like this:

https://codepen.io/starikovs/pen/tngqH

But what I recommend you is to start from something more basic; we can improve the rules bundle over time.

Just creating a rules bundle that groups the alt version of these rules sound like a good start to me:

https://github.com/microlinkhq/metascraper/blob/master/packages/metascraper-image/index.js#L10

Feeling at the end you can't correlate image/image-alt values at all; although both things are related, HTML markup is a jungle. Even just an alt without an image is semantically valid, plus the low presence of the selector tells me that maybe we can't be so strict as we want there.

Also, I recommend you start just with a plain list of alt selectors because we can test them against our integration tests:

https://github.com/microlinkhq/metascraper/tree/master/packages/metascraper/test/integration

so we can have a global vision of how these selectors are used.

Kikobeats commented 3 years ago

Not seeing clear the direction, closing for now and it will be revisited in the future