nystudio107 / craft-seomatic

SEOmatic facilitates modern SEO best practices & implementation for Craft CMS 3. It is a turnkey SEO system that is comprehensive, powerful, and flexible.
https://nystudio107.com/plugins/seomatic
Other
166 stars 70 forks source link

multisite and multiple robots.txt #859

Closed hiasl closed 3 years ago

hiasl commented 3 years ago

Question

I'm using Craft 3.4.30 with Seomatic 3.3.35

Our setup is multisite, 2 site groups (2 different websites with 2 domains), 6 languages in each site group. NOT headless. The primary language (en) is served from the root of each domain https://domain/ and https://domain2/, the other 5 languages from a URL segment/directory with the ISO code of each language, e.g.

The problem: SEOmatic creates separate robots.txt for each multisite/language, although they share the same domain. So I get a

In my point of view SEOmatic's behavior is wrong, there should be only 1 robots.txt per domain linking to ALL sitemaps in all languages. I do not think Search Engines will try to look for a robots.txt in each language directory /de/robots.txt, /fr/robots.txt, ... If there were other languages served from e.g. subdomains, there should be its own robots.txt.

Please let me know what you think and if you agree, please try to find a solution. My suggestion is to make sure that there is only one robots.txt per domain, which should contain all relevant links for all multisites within that domain.

khalwat commented 3 years ago

Can you give me an actual example to look at?

What SEOmatic does—assuming you have Site Groups define logically separate sites on in SEOmatic -> Plugin Settings -> Advanced (which is is by default) is described here:

https://nystudio107.com/docs/seomatic/Technologies.html#multi-site-language-locale-support

You'll get one sitemap for each site in the site group, but it will have

<xhtml:link rel="alternate" hreflang="xx-xx">

...links to the same pages in other languages, assuming your entries are localized.

hiasl commented 3 years ago

1.) We have the feature "Site Groups define logically separate sites" turned on 2.) Having different Sitemaps is totally fine 3.) Having different/multiple robots.txt in the same domain is wrong in my eyes, and this is the issue here. 4.) The main domain is https://www.concertvienna.com/, the first robots.txt is https://www.concertvienna.com/robots.txt, but there are more at https://www.concertvienna.com/de/robots.txt, https://www.concertvienna.com/fr/robots.txt, ...

The second domain I wrote about is not live yet, I just mentioned it for completeness.

khalwat commented 3 years ago

Can you articulate to me why you believe this is wrong? Here's the spec:

https://developers.google.com/search/docs/advanced/robots/robots_txt

The robots.txt for each domain notes what paths are disallowed from searching through relative to the domain. Additionally, the sitemaps that each links to are actually localized to point to the direct URLs relative to that domain as well as other translations of them.

For example, go here:

https://www.concertvienna.com/de/sitemaps-1-section-pagesCv-4-sitemap.xml

and choose "view source" to see more than the human-readable version of the sitemap, and you'll see all of the localizations listed:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="sitemap.xsl"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:video="http://www.google.com/schemas/sitemap-video/1.1" xmlns:xhtml="http://www.w3.org/1999/xhtml">
  <url>
    <loc>https://www.concertvienna.com/de</loc>
    <lastmod>2021-03-05T14:08:57+01:00</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.5</priority>
    <xhtml:link rel="alternate" hreflang="x-default" href="https://www.concertvienna.com"/>
    <xhtml:link rel="alternate" hreflang="en" href="https://www.concertvienna.com"/>
    <xhtml:link rel="alternate" hreflang="fr" href="https://www.concertvienna.com/fr"/>
    <xhtml:link rel="alternate" hreflang="de" href="https://www.concertvienna.com/de"/>
    <xhtml:link rel="alternate" hreflang="ru" href="https://www.concertvienna.com/ru"/>
    <image:image>
      <image:loc>http://www.concertvienna.com/user/images/iStock-175563116.jpg</image:loc>
      <image:title>Vienna Opera</image:title>
    </image:image>
  </url>
  <url>
    <loc>https://www.concertvienna.com/de/oper-wien</loc>
    <lastmod>2021-03-12T13:02:37+01:00</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.5</priority>
    <xhtml:link rel="alternate" hreflang="x-default" href="https://www.concertvienna.com/opera-vienna"/>
    <xhtml:link rel="alternate" hreflang="en" href="https://www.concertvienna.com/opera-vienna"/>
    <xhtml:link rel="alternate" hreflang="fr" href="https://www.concertvienna.com/fr/opera-vienne"/>
    <xhtml:link rel="alternate" hreflang="de" href="https://www.concertvienna.com/de/oper-wien"/>
    <xhtml:link rel="alternate" hreflang="ru" href="https://www.concertvienna.com/ru/opera-v-vene"/>
  </url>
  <url>
    <loc>https://www.concertvienna.com/de/wiener-staatsoper</loc>
    <lastmod>2021-03-12T11:38:14+01:00</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.5</priority>
    <xhtml:link rel="alternate" hreflang="x-default" href="https://www.concertvienna.com/vienna-state-opera"/>
    <xhtml:link rel="alternate" hreflang="en" href="https://www.concertvienna.com/vienna-state-opera"/>
    <xhtml:link rel="alternate" hreflang="fr" href="https://www.concertvienna.com/fr/opera-de-vienne"/>
    <xhtml:link rel="alternate" hreflang="de" href="https://www.concertvienna.com/de/wiener-staatsoper"/>
    <xhtml:link rel="alternate" hreflang="ru" href="https://www.concertvienna.com/ru/venskaya-opera"/>
    <image:image>
      <image:loc>http://www.concertvienna.com/user/images/iStock-501071488.jpg</image:loc>
      <image:title>I Stock 501071488</image:title>
    </image:image>
  </url>
  <url>
    <loc>https://www.concertvienna.com/de/volksoper-wien</loc>
    <lastmod>2021-03-12T11:38:47+01:00</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.5</priority>
    <xhtml:link rel="alternate" hreflang="x-default" href="https://www.concertvienna.com/vienna-volksoper"/>
    <xhtml:link rel="alternate" hreflang="en" href="https://www.concertvienna.com/vienna-volksoper"/>
    <xhtml:link rel="alternate" hreflang="fr" href="https://www.concertvienna.com/fr/opera-populaire-vienne"/>
    <xhtml:link rel="alternate" hreflang="de" href="https://www.concertvienna.com/de/volksoper-wien"/>
    <xhtml:link rel="alternate" hreflang="ru" href="https://www.concertvienna.com/ru/venskaya-narodnaya-opera"/>
    <image:image>
      <image:loc>http://www.concertvienna.com/user/images/Presse_VolksoperDSC_6988.jpg</image:loc>
      <image:title>Presse Volksoper DSC 6988</image:title>
    </image:image>
  </url>
  <url>
    <loc>https://www.concertvienna.com/de/ballett-wien</loc>
    <lastmod>2021-02-10T13:55:51+01:00</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.5</priority>
    <xhtml:link rel="alternate" hreflang="x-default" href="https://www.concertvienna.com/ballet-vienna"/>
    <xhtml:link rel="alternate" hreflang="en" href="https://www.concertvienna.com/ballet-vienna"/>
    <xhtml:link rel="alternate" hreflang="fr" href="https://www.concertvienna.com/fr/ballet-vienne"/>
    <xhtml:link rel="alternate" hreflang="de" href="https://www.concertvienna.com/de/ballett-wien"/>
    <xhtml:link rel="alternate" hreflang="ru" href="https://www.concertvienna.com/ru/balet-v-vene"/>
  </url>
  <url>
    <loc>https://www.concertvienna.com/de/spanische-hofreitschule</loc>
    <lastmod>2021-03-12T11:40:33+01:00</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.5</priority>
    <xhtml:link rel="alternate" hreflang="x-default" href="https://www.concertvienna.com/spanish-riding-school"/>
    <xhtml:link rel="alternate" hreflang="en" href="https://www.concertvienna.com/spanish-riding-school"/>
    <xhtml:link rel="alternate" hreflang="fr" href="https://www.concertvienna.com/fr/ecole-espagnole-equitation"/>
    <xhtml:link rel="alternate" hreflang="de" href="https://www.concertvienna.com/de/spanische-hofreitschule"/>
    <xhtml:link rel="alternate" hreflang="ru" href="https://www.concertvienna.com/ru/ispanskaya-shkola-verkhovoy-yezdy"/>
    <image:image>
      <image:loc>http://www.concertvienna.com/user/images/csm_morning_exercise_c_Spanish_Riding_School_Julie_Brass_-_Kopie_eb1d6a3c99.jpg</image:loc>
      <image:title>Csm morning exercise c Spanish Riding School Julie Brass Kopie eb1d6a3c99</image:title>
    </image:image>
  </url>
  <url>
    <loc>https://www.concertvienna.com/de/konzerte-wien</loc>
    <lastmod>2021-03-12T13:34:33+01:00</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.5</priority>
    <xhtml:link rel="alternate" hreflang="x-default" href="https://www.concertvienna.com/concerts-vienna"/>
    <xhtml:link rel="alternate" hreflang="en" href="https://www.concertvienna.com/concerts-vienna"/>
    <xhtml:link rel="alternate" hreflang="fr" href="https://www.concertvienna.com/fr/concerts-vienne"/>
    <xhtml:link rel="alternate" hreflang="de" href="https://www.concertvienna.com/de/konzerte-wien"/>
    <xhtml:link rel="alternate" hreflang="ru" href="https://www.concertvienna.com/ru/koncerty-v-vene"/>
    <image:image>
      <image:loc>http:
//www.concertvienna.com/user/images/1432217879Orchestra_CU2_1600x1100_200404_223601.jpg</image:loc>
      <image:title>1432217879 Orchestra CU2 1600x1100</image:title>
    </image:image>
  </url>
  <url>
    <loc>https://www.concertvienna.com/de/datenschutzerklaerung</loc>
    <lastmod>2020-12-09T10:40:46+01:00</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.5</priority>
    <xhtml:link rel="alternate" hreflang="x-default" href="https://www.concertvienna.com/privacy"/>
    <xhtml:link rel="alternate" hreflang="en" href="https://www.concertvienna.com/privacy"/>
    <xhtml:link rel="alternate" hreflang="fr" href="https://www.concertvienna.com/fr/privacy"/>
    <xhtml:link rel="alternate" hreflang="de" href="https://www.concertvienna.com/de/datenschutzerklaerung"/>
    <xhtml:link rel="alternate" hreflang="ru" href="https://www.concertvienna.com/ru/privacy"/>
  </url>
  <url>
    <loc>https://www.concertvienna.com/de/imprint</loc>
    <lastmod>2020-12-09T10:40:31+01:00</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.5</priority>
    <xhtml:link rel="alternate" hreflang="x-default" href="https://www.concertvienna.com/legal"/>
    <xhtml:link rel="alternate" hreflang="en" href="https://www.concertvienna.com/legal"/>
    <xhtml:link rel="alternate" hreflang="fr" href="https://www.concertvienna.com/fr/legal"/>
    <xhtml:link rel="alternate" hreflang="de" href="https://www.concertvienna.com/de/imprint"/>
    <xhtml:link rel="alternate" hreflang="ru" href="https://www.concertvienna.com/ru/legal"/>
  </url>
  <url>
    <loc>https://www.concertvienna.com/de/kontakt</loc>
    <lastmod>2020-12-09T10:40:58+01:00</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.5</priority>
    <xhtml:link rel="alternate" hreflang="x-default" href="https://www.concertvienna.com/contact"/>
    <xhtml:link rel="alternate" hreflang="en" href="https://www.concertvienna.com/contact"/>
    <xhtml:link rel="alternate" hreflang="fr" href="https://www.concertvienna.com/fr/contact"/>
    <xhtml:link rel="alternate" hreflang="de" href="https://www.concertvienna.com/de/kontakt"/>
    <xhtml:link rel="alternate" hreflang="ru" href="https://www.concertvienna.com/ru/contacts"/>
  </url>
  <url>
    <loc>https://www.concertvienna.com/de/agbs</loc>
    <lastmod>2021-02-22T13:18:07+01:00</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.5</priority>
    <xhtml:link rel="alternate" hreflang="x-default" href="https://www.concertvienna.com/terms"/>
    <xhtml:link rel="alternate" hreflang="en" href="https://www.concertvienna.com/terms"/>
    <xhtml:link rel="alternate" hreflang="fr" href="https://www.concertvienna.com/fr/terms"/>
    <xhtml:link rel="alternate" hreflang="de" href="https://www.concertvienna.com/de/agbs"/>
    <xhtml:link rel="alternate" hreflang="ru" href="https://www.concertvienna.com/ru/terms"/>
  </url>
  <url>
    <loc>https://www.concertvienna.com/de/dinner-konzert-wien</loc>
    <lastmod>2021-03-12T11:42:44+01:00</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.5</priority>
    <xhtml:link rel="alternate" hreflang="x-default" href="https://www.concertvienna.com/dinner-concert-vienna"/>
    <xhtml:link rel="alternate" hreflang="en" href="https://www.concertvienna.com/dinner-concert-vienna"/>
    <xhtml:link rel="alternate" hreflang="fr" href="https://www.concertvienna.com/fr/diner-concert-vienne"/>
    <xhtml:link rel="alternate" hreflang="de" href="https://www.concertvienna.com/de/dinner-konzert-wien"/>
    <xhtml:link rel="alternate" hreflang="ru" href="https://www.concertvienna.com/ru/uzhin-koncert"/>
    <image:image>
      <image:loc>http://www.concertvienna.com/user/images/concert-dinner-kursalon-vienna.jpg</image:loc>
      <image:title>Concert dinner kursalon vienna</image:title>
    </image:image>
  </url>
</urlset>
hiasl commented 3 years ago

The problem has nothing to do with sitemaps. All sitemaps and alternate links are totally ok.

The problem is, that there are multiple robots.txt. One domain should not have more than one robots.txt. Also the Google spec you cited just confirmed that: This is from https://developers.google.com/search/docs/advanced/robots/robots_txt: http://example.com/folder/robots.txt | Not a valid robots.txt file. Crawlers don't check for robots.txt files in subdirectories.

Since the different languages within the site group are all served by the same domain "www.concertvienna.com", there should only be one "www.concertvienna.com/robots.txt" and not one for each language www.concertvienna.com/de/robots.txt, www.concertvienna.com/fr/robots.txt, ...

hiasl commented 3 years ago

And to make it more specific again: this ONE robots.txt should then link all primary language sitemaps, not only the one in the language you are currently in.

khalwat commented 3 years ago

I'm not seeing any actual negative impact from this, though. What it's saying is just that bots will not look for /robots.txt anywhere but in the root domain. So it'll use https://www.concertvienna.com/robots.txt in your example, and just never find the others.

And since the sitemap it links to properly handles multiple languages, we should be good there too.

However, it's true that this is vestigial for sites that are localized via sub-directories and not domains. Given that you can set the site to any domain you want, it sounds like what we're looking for here is:

Would that do it for you?

hiasl commented 3 years ago

Yes, I think the second point is the important one, the first one makes the solution cleaner though. The important thing is to additionally link to the primary sitemap of each language/multisite containing a path, in the same domain.

khalwat commented 3 years ago

Okay. Also keep in mind that SEOmatic automatically sends the sitemaps to Google and Bing, as you should see in your web console, and each sitemap has hreflang links to translations in other languages to index.

So I think we're actually talking about minimal to no impact here, in terms of bots not discovering these sitemaps appropriately.

hiasl commented 3 years ago

I agree, impact might be minimal, but the current implementation is not perfect. I was confronted with the issue by a SEO agency, I did not even notice. They put it in on the list of things to improve. So this is where I am now, the customer just has the info that there is something wrong here. Of course I would be happy if this could be improved, so that I can mark this issue as solved with the customer.

Please just let me know if you're going to look into it or not. Thanks!

khalwat commented 3 years ago

Definitely going to address it!

khalwat commented 3 years ago

This has been addressed in the above commits.

You can try it now by setting your semver in your composer.json to look like this:

    "nystudio107/craft-seomatic": "dev-develop as 3.3.36”,

Then do a composer update

hiasl commented 3 years ago

I installed the dev version as 3.3.36, but only half of it works for me:

I did clear all SEOmatic caches. I also recreated all sitemaps, just in case this is also needed for robots.txt

khalwat commented 3 years ago

This is due to a cached template, which should have been propagated, but for some reason was not.

Can you look at the seomatic_metabundles database table, and tell me what the version numbers of the rows with __GLOBAL_BUNDLE__ are?

hiasl commented 3 years ago

I only have __GLOBAL_BUNDLE__ rows (missing META), and if you mean the bundleVersion column, this is 1.0.47 for those.

khalwat commented 3 years ago

none of them were updated beyond 1.0.47? It should lazily update the meta bundles, so clearing caches and then visiting the pages/sites should cause the meta bundles to update.

hiasl commented 3 years ago

Bildschirmfoto 2021-03-24 um 12 50 10

hiasl commented 3 years ago

which caches should I clear? only seomatic or all?

khalwat commented 3 years ago

Clear the SEOmatic caches, then on the frontend of the website, visit a page of each Site and it should lazily update the bundles.

I will verify on my end as well.

hiasl commented 3 years ago

I did, nothing changed. Cleared SEOmatic caches, visited 4 different langues, bundleVersion is still 1.0.47 I assume with lazily you mean immediately after I visited the Frontend pages, not some time later, right?

khalwat commented 3 years ago

Yes. I will verify on my end as well. This is the fix commit: https://github.com/nystudio107/craft-seomatic/commit/332b67a66c2bac4f6398876514960fadc2ec7de2

khalwat commented 3 years ago

@hiasl confirmed there was a regression that could cause the metabundles to not update; fixed in https://github.com/nystudio107/craft-seomatic/commit/5a77c461d511fb9c473624de0b4e3a4aa52d2c76

You can try it now by setting your semver in your composer.json to look like this:

    "nystudio107/craft-seomatic": "dev-develop as 3.3.36”,

Then do a composer clear-cache && composer update

Here's what it looks like for me in local dev:

# robots.txt for http://localhost:8000/

sitemap: http://localhost:8000/sitemaps-1-sitemap.xml
sitemap: http://localhost:8000/es/sitemaps-1-sitemap.xml

# local - disallow all

User-agent: *
Disallow: /
hiasl commented 3 years ago

robots.txt still stays the same (only one sitemap). But bundleVersion increase now.

What I did:

These are my site settings: Bildschirmfoto 2021-03-24 um 15 25 58

khalwat commented 3 years ago

Okay so I went down the rabbit hole to figure out what was going wrong here, and it turns out this is expected behavior... but I'm open to discussion about whether this is good behavior or not.

So the frontend templates like robots, humans, ads, etc. allow you to edit them in the CP. In our case, under SEOmatic -> Global SEO -> Robots

When updating the meta containers, it preserves any data that is user-editable. If we didn't do this, then someone who had customized their Robots.txt would have it blown away by the update.

So in order for the fix to fully propagate here on existing sites, you'd need to manually paste the context of the craft-seomatic/src/templates/_frontend/pages/robots.twig into your Global settings for each site. Here it is:

# robots.txt for {{ siteUrl }}

{{ seomatic.helper.siteGroupSitemaps() }}
{% switch seomatic.config.environment %}

{% case "live" %}

# live - don't allow web crawlers to index cpresources/ or vendor/

User-agent: *
Disallow: /cpresources/
Disallow: /vendor/
Disallow: /.env
Disallow: /cache/

{% case "staging" %}

# staging - disallow all

User-agent: *
Disallow: /

{% case "local" %}

# local - disallow all

User-agent: *
Disallow: /

{% default %}

# default - don't allow web crawlers to index cpresources/ or vendor/

User-agent: *
Disallow: /cpresources/
Disallow: /vendor/
Disallow: /.env
Disallow: /cache/

{% endswitch %}

I'm open to input on how to handle this. On the one hand, not having it just automatically update to fix the issue is confusing. On the other hand, blowing away data that the user may have customized is likely worse.

hiasl commented 3 years ago

I can confirm this is now working with {{ seomatic.helper.siteGroupSitemaps() }} Thanks a lot!

One little thing: {{ seomatic.helper.siteGroupSitemaps() }} outputs the leading word "sitemap:" in lowercase. I do not know if search engines are picky about that, but it's normally written with a capital "S" at the beginning of "Sitemap".

Regarding the propagation of this fix, I have 2 ideas: 1.) My favourite: Why don't you deprecate {{ seomatic.helper.sitemapIndexForSiteId() }} and {{ seomatic.helper.siteGroupSitemaps() }} and just call it {{ seomatic.helper.sitemapIndex() }}.

This new method outputs ALL sitemaps within the same domain, ignoring any additional paths. This works even across site groups which is totally ok since there can always only be one robots.txt per domain. This would cover all cases:

For this idea you could even content migrate the robots.txt templates and replace Sitemap: {{ seomatic.helper.sitemapIndexForSiteId() }} with {{ seomatic.helper.sitemapIndex() }}. And I guess it will not have any negative effects on existing installations.

2.) A less invasive idea would be to content migrate the robots.txt template field SEOmatic's global settings and replace {{ seomatic.helper.sitemapIndexForSiteId() }} with

{{ seomatic.helper.sitemapIndexForSiteId() }} 
{# use this for multisites in single domains {{ seomatic.helper.siteGroupSitemaps() }} #}

if {{ seomatic.helper.siteGroupSitemaps() }} is not part of the field yet... But not sure if this is really good.

khalwat commented 3 years ago

Yeah I checked the spec, and lowercase is actually what they list, but either is fine.

I like your ideas for the migration, the thing that's bothering me is there's no real place for it currently. I'd need to special-case for this particular template, which feels a little gross.

I'll see if I can't come up with something more general, and in the meantime, a manual update isn't the end of the world.

khalwat commented 3 years ago

Found a decent vector:

seomatic.helper.siteGroupSitemaps() -> seomatic.helper.sitemapIndex() -> https://github.com/nystudio107/craft-seomatic/commit/2e80b7f35435a02cc5b4298f4e5a20e4c37e61b5

Swap in the new robots.txt sitemaps -> https://github.com/nystudio107/craft-seomatic/commit/3059c8c6da2c1169573e1a396dfe9fb3ab27b905

OwenMelbz commented 3 years ago

FYI I was reading the spec you sent over https://developers.google.com/search/docs/advanced/robots/robots_txt

And notice it says robots.txt in sub folders are not valid.

image

That would suggest that there should only be the very first .com/robots.txt ??

khalwat commented 3 years ago

@OwenMelbz It's not doing this anymore, please see the changes above, which are live.

It's also a very minor issue, given that the sitemaps were all available from the root, with proper hreflang links, and the individual sitemaps were submitted to Google/Bing automatically regardless.

OwenMelbz commented 3 years ago

@khalwat I'd updated our robots.txt config with the new stub from github, then cleared all the caches - but we're still getting a 404 for /robots.txt since updating e.g. https://www.fsifm.com/robots.txt Any thoughts?

khalwat commented 3 years ago

Likely to be a server-side configuration issue. Track it down in the logs.

The 404 is unlikely to be related to this issue.

OwenMelbz commented 3 years ago

There's no errors in the logs at all, doesn't seomatic register the route?

Like it does for the sitemaps, which work.

This also worked before the update, no server configurations have changed. Just a composer install and project config/apply

khalwat commented 3 years ago

Probably a new issue should be filed for this, but assuming nothing else has changed other than the update to SEOmatic (and it worked before), then it's likely failing here:

https://github.com/nystudio107/craft-seomatic/blob/v3/src/services/FrontendTemplates.php#L90

So ensure that your site:

1 - has a Base URL set 2 - the Base URL does not have a sub-directory as part of it