oscarotero / Embed

Get info from any web service or page
MIT License
2.09k stars 312 forks source link

Canonical URL seems to be wrong #478

Closed grandeljay closed 2 years ago

grandeljay commented 2 years ago

Please excuse me if I am using this wrong but the canonical URL seems to be wrong here:

https://oscarotero.com/embed/demo/index.php?url=https%3A%2F%2Fwww.etsy.com%2Fde%2Flisting%2F653731908%2Fvalentinstag-geschenk-fur-ihn-holzuhr%3Fclick_key%3D49&settings=

It's showing the URL I used instead of the canonical URL from the source or even an empty value, which I would prefer. Is it maybe a feature and defaulting to my URL instead of an empty value?

image image image

oscarotero commented 2 years ago

Hi. Canonical URL is not always a reliable value. I had too many issues from sites that returns the same canonical url for all pages, or goes to a different page. This is why it's not used.

grandeljay commented 2 years ago

I had too many issues from sites that returns the same canonical url for all pages, or goes to a different page.

Wouldn't that be the site owners problem? If they mark a URL as canonical shouldn't that be used? Even if it seems wrong to us?

oscarotero commented 2 years ago

Yes, if all developers made everything right and standard, it would be more easy but sadly it's not happening :) Even Google can ignore the canonical value on index a website for various reasons.

To include the canonical url or any other desired value, you can extend the Url detector.

grandeljay commented 2 years ago

Oh, neat. Thanks!

grandeljay commented 2 years ago

Is there an easy way to just return null or something empty if the canonical URL is not set? I'm not understanding how this works :/

public function detect(): UriInterface
{
    $oembed = $this->extractor->getOEmbed();

    return $oembed->url('url') ?? new Url('');
}
oscarotero commented 2 years ago

The url cannot be null. What you can do is check the canonical url:

public function detect(): UriInterface
{
    $oembed = $this->extractor->getOEmbed();

    return $oembed->url('url')
        ?: $oembed->url('web_page')
        ?: $this->getCanonical()
        ?: $this->extractor->getUri();
}

protected function getCanonical(): ?UriInterface
{
    $document = $this->extractor->getDocument();
    foreach ($document->select('.//link[@canonical]')->nodes() as $node) {
        $href = $node->getAttribute('href');

        if ($href) {
            return $this->extractor->resolveUri($href);
        }
    }
}

(Not tested)

grandeljay commented 2 years ago

This is perfect, thank you so much!

Just in case anybody else is interested, here is my slightly revised version (with a return null statement) I am using:

public function detect(): UriInterface
{
    $oembed = $this->extractor->getOEmbed();

    return $oembed->url('url')
        ?: $oembed->url('web_page')
        ?: $this->getCanonical()
        ?: $this->extractor->getUri();
}

protected function getCanonical(): ?UriInterface
{
    $document = $this->extractor->getDocument();

    foreach ($document->select('.//link[@canonical]')->nodes() as $node) {
        $href = $node->getAttribute('href');

        if ($href) {
            return $this->extractor->resolveUri($href);
        }
    }

    return null;
}

The full file is here

grandeljay commented 2 years ago

@oscarotero if you're willing to consider it, I'd open a separate issue for it, but I would love to have an indicator/property, etc to determine if the sites canonical tag was found/used or not.

Is that possible?

oscarotero commented 2 years ago

You can create a different detector instead of extending the current one. This allows to have a custom property, like $info->canonical.

See this: https://github.com/oscarotero/Embed#detectors