microlinkhq / metascraper

Get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more.
https://metascraper.js.org
MIT License
2.35k stars 168 forks source link

Get a description from the first paragraph #471

Closed mikkmartin closed 3 years ago

mikkmartin commented 3 years ago

If description is missing use paragraph, use <p> A naive implementation of how Facebook does it.

Kikobeats commented 3 years ago

Thanks for the PR! Please take a look at the tests, there are some linter issues.

My concern is this is going to introduce inaccurate values that will make it not worth it to introduce this rule.

Most sites don't care about HTML markup, so you are expecting text content over p, but probably they are using span or div as well, adding more imprecision into the rule.

mikkmartin commented 3 years ago

Thanks for the reply.

True. This can cause inaccurate results, depending on if you value spec "accurate" description: null over description: best guess.

Have you encountered this use-case of something over nothing before? Any alternatives/packages to look at?

Anyway thanks in advance

Kikobeats commented 3 years ago

If you want to have enough good description fallback, probably the best solution right now is metascraper-readability.

You can combine both packages to create a pipeline based on priority:

const metascraper = require('metascraper')([
  require('metascraper-description')(),
  require('metascraper-readability')()
])

So if the value is not resolved by metascraper-description, then metascraper-readability will give be a try 🙂

mikkmartin commented 3 years ago

Ok, perfect. Thanks for the help! :)