mozilla / readability

A standalone version of the readability lib
Other
8.8k stars 598 forks source link

feat: add support for parsely published date, title, and author #865

Closed inhumantsar closed 4 months ago

inhumantsar commented 5 months ago

Adds Parsely tags as a fallback option for metadata. Parsely is a content analytics service aimed at larger publishers running Wordpress, eg: The Verge.

It's worth noting that Parsely tags are unlikely to exist in isolation and seem to be populated alongside og tags and JSONLD data in nearly all cases. I would like to add other tag sets which will be more valuable though and this was a nice simple one to familiarize myself with. I will totally understand if the preference is to keep the regex patterns from growing too large by leaving out less common sources of metadata like this.

fchasen commented 4 months ago

Thanks for these, but as you mentioned I'm a bit torn on if this make sense to add.

One the on hand, these tags seem widely enough used to include but does seem to be repeated info. Is this capturing different metadata we wouldn't get from the JSON-LD already or just in case a site includes these tags but not the JSON?

Looking through the JSON-LD description in https://docs.parse.ly/metadata-jsonld/, they have a few types included as a "post" type that we don't look for so might be worth adding those to jsonLdArticleTypes at the very least.

inhumantsar commented 4 months ago

I'd say it's mainly for sites that don't include JSON-LD. I've run into a few others like these too. eg: I have an open issue right now for dc:* and prism:*. They seem to be used on academics-adjacent sites, eg Nature and Our World in Data. Neither of those sites use JSON-LD.

Adding these will be a bit repetitive in the codebase but it would make metadata capture much more consistent.

I'll include the JSON-LD equivalents for anything new as well.