microlinkhq / metascraper

Get unified metadata from websites using Open Graph, Microdata, RDFa, Twitter Cards, JSON-LD, HTML, and more.
https://metascraper.js.org
MIT License
2.35k stars 168 forks source link

Link with metadata not working e.g. https://www.alltrails.com/lists/kate-agnew-ny-trails #414

Closed alexpluto closed 3 years ago

alexpluto commented 3 years ago

Prerequisites

Subject of the issue

When passing this link in: https://www.alltrails.com/lists/kate-agnew-ny-trails

An error is returned, but when inspecting the HTML for this page, it has metadata.

Error: HTTPError: Response code 403 (Forbidden)

Steps to reproduce

Note: You can reproduce the code using interactive Node.js shell by Runkit.

Get metadata for: https://www.alltrails.com/lists/kate-agnew-ny-trails

Expected behaviour

Metadata is returned

Actual behaviour

Error returned: https://www.alltrails.com/lists/kate-agnew-ny-trails

alexpluto commented 3 years ago

Here is another link that is failing: https://snowflakegelato.co.uk/

When this link also has a title, hero image, etc.

Kikobeats commented 3 years ago

That's actually not a metascraper issue.

For example, check rules under metascraper-description. These rules are applied against the HTML markup over the target URL in order to find the first rule with a valid value.

If the target URL doesn't have any of these rules, then metascraper doesn't find any value to extract.

These target URLs have very poor HTML markup in terms of sharing.

alexpluto commented 3 years ago

@Kikobeats thanks for the quick reply! Really appreciate.

Strangely, with the All Trails link (https://www.alltrails.com/lists/kate-agnew-ny-trails), I can see the of:title, of:image but it is missing og:description

Shouldn't these fields be returned, even if description fails?

Is the error Error: HTTPError: Response code 403 (Forbidden) from the missing description?

Kikobeats commented 3 years ago

It's a network issue related to antibot protection that the target URL has.

In order to being possible to fetch the content there, I recommend you setup your own proxy service against Microlink: https://microlink.io/docs/api/parameters/proxy

alexpluto commented 3 years ago

@Kikobeats thank you! I'll look into this, we do use a proxy for other requests.

Kikobeats commented 3 years ago

Hey,

I wrote a blogpost explaining what's happening there:

https://microlink.io/blog/proxy-capabilities/

alexpluto commented 3 years ago

Great post, thank you!