nystudio107 / craft-seomatic

SEOmatic facilitates modern SEO best practices & implementation for Craft CMS 3. It is a turnkey SEO system that is comprehensive, powerful, and flexible.
https://nystudio107.com/plugins/seomatic
Other
166 stars 70 forks source link

Sitemap with 503 error in tools like Semrush #1213

Closed danfathom closed 2 years ago

danfathom commented 2 years ago

Describe the bug

We are getting the exact same error with all out sites using SEOmatic as described on this previous closed issue - https://github.com/nystudio107/craft-seomatic/issues/1143

We have updated Craft and SEOmatic multiple times in the past and issue still persists.

Just to confirm, I will explain the below:

We have a number of sites now, different servers, where different audit tools, like Semrush, flag the sitemap link with a 503 error.

Have you any ideas to fix this?

Screenshots

I've hidden the website domain in the following screenshot Screen-Shot-2022-09-26-at-12 34 05

Versions

khalwat commented 2 years ago

SEOmatic only generates a 503 for sitemaps if they have not been generated yet; after that, it caches the result, and it will be returned with a 200 error code.

The 503 that it returns includes the Google-recommended Retry-After set to 60 minutes:

https://developers.google.com/search/blog/2011/01/how-to-deal-with-planned-site-downtime

Are you having issues on your site in terms of queue jobs running, which is how the sitemaps are built?

I'd also like to see a live URL to look at, so I can see the 503 error happening in the wild. Can you provide one?

danfathom commented 2 years ago

Hi,

We don't have any issues in terms of queue jobs running as far as I can see. I've just double checked and the Queue manager has no pending jobs. I've also checked the queue.log file, and there aren't any errors in there either.

Sure, the site is https://albertgoodman.co.uk/

Please let me know how you get on and if there is a fix for this.

Many thanks, Dan

khalwat commented 2 years ago

I'm not sure there is anything to be fixed; I'm checking your sitemaps, and I'm not seeing a 503 for any of them?

https://albertgoodman.co.uk/sitemaps-1-sitemap.xml

None of the URLs listed in the report you've provided are returning a 503 for me:

https://albertgoodman.co.uk/sitemaps-1-section-agInsolvency-1-sitemap.xml

https://albertgoodman.co.uk/sitemaps-1-section-agPayroll-1-sitemap.xml

Are you sure this isn't an issue with Semrush, or URLs that need to be re-crawled?

khalwat commented 2 years ago

Have you tried checking in your Google Search Console to see if it is having any issues reading these sitemaps?

Have you tried any tools other than SEMrush to see if they also are reporting the same 503?

jwmatlock commented 2 years ago

@khalwat I've had pretty much the same experience as @danfathom with regards to Semrush's findings. At first I was just conceeding that it was Semrush/bot flaw that was returning the 503, but then I was able to catch the 503 in the act. I saw the 503 in my browser. This happens for sitemaps that have been created, but then look like they are potentially be recreated again.

khalwat commented 2 years ago

@jwmatlock That sounds like correct behavior then, I'd think? In other words, SEOmatic is serving up the old sitemap, but it's letting Google et al know that a new version will be available, and check back in 60 minutes.

I'll delve into it to see if a better result code could be returned in this specific instance, where the sitemap is being served from the cache, but also a new sitemap is being generated behind the scenes.

Unless you're seeing issues in your Google Search Console, however, I don't think it's problematic behavior.

khalwat commented 2 years ago

So I searched the entire codebase, and there is only one place that it will return a 503 for a sitemap, and that is here:

https://github.com/nystudio107/craft-seomatic/blob/develop/src/models/SitemapTemplate.php#L212

It only ever gets to there if the sitemap has been invalidated due to someone changing something in the CP, which then spawns a queue job to regenerate the sitemap:

https://github.com/nystudio107/craft-seomatic/blob/develop/src/services/Sitemaps.php#L513

So SEOmatic should never serve up a rendered sitemap with a 503 result code.

In looking at it, though, I can see room for improvement... it should be using a "stale while revalidate" pattern, rather than invalidating the cache and then regenerating the sitemap.

khalwat commented 2 years ago

So I'm thinking if we remove this line:

        TagDependency::invalidate($cache, SitemapTemplate::SITEMAP_CACHE_TAG . $handle . $siteId);
        Craft::info(
            'Sitemap cache cleared: ' . $handle,
            __METHOD__
        );

https://github.com/nystudio107/craft-seomatic/blob/develop/src/services/Sitemaps.php#L517

...and just never invalidate the cache, because we overwrite anything that might be cached after the now invalidated sitemap is rendered:

https://github.com/nystudio107/craft-seomatic/blob/develop/src/helpers/Sitemap.php#L364

...this will give us the "stale while revalidate" cache we're looking for, which will eliminate the window during which a 503 could be returned.

I'm still okay with it returning a 503 with Retry-After set, as this is what Google recommends -- but this will reduce the window in which it could ever happen.

Worst-case, the old sitemap will be served up while the new one is regenerating.

danfathom commented 2 years ago

Hi,

Thanks for looking into this. Sounds like a good solution.

Could you let me know if you plan to bring out an update with this fix, and when it will be available?

Many thanks, Dan

khalwat commented 2 years ago

@danfathom I'm working on it -- but I want to be clear that other than the reports you're seeing in SEMrush, I don't think this affects your site in any way. Google recommends this pattern when you have a page that GoogleBot should check back on in a given time period.

Can you check in your Google Search Console, and verify that the sitemaps are listed as properly digested there?

danfathom commented 2 years ago

I can confirm the sitemap shows fine in search console.

Up to now we've said this is an issue with SEMrush, but we send our clients monthly SEO reports and unfortunately this brings their SEO score down, and we always get asked if we can fix it, so it would be great to have it fixed for this purpose.

I appreciate you looking into it for us.

khalwat commented 2 years ago

Addressed in: https://github.com/nystudio107/craft-seomatic/commit/d7ab612044ba7cbd011e273ae31bfcb2d0231050 & https://github.com/nystudio107/craft-seomatic/commit/309fb39907876169a5215208abce4a7526c81a98

Craft CMS 3:

You can try it now by setting your semver in your composer.json to look like this:

    "nystudio107/craft-seomatic": "dev-develop as 3.4.39”,

Then do a composer clear-cache && composer update

…..

Craft CMS 4:

You can try it now by setting your semver in your composer.json to look like this:

    "nystudio107/craft-seomatic": "dev-develop-v4 as 4.0.9”,

Then do a composer clear-cache && composer update

danfathom commented 2 years ago

Brilliant, thank you! I'll give that a try!

danfathom commented 2 years ago

Unfortunately it looks like the 3.4.39 version has not fixed the issue, as the sitemap error has come back again in Semrush (see screenshot below).

screencapture-semrush-siteaudit-campaign-4311837-review-2022-10-31-14_02_52

I can confirm we are on Craft 3.7.57, and SEOmatic 3.4.39.

khalwat commented 2 years ago

Feel free to reach out to me to do a screen share if you like; something isn't right here, imo.

danfathom commented 1 year ago

That would be great, thanks. Are you free today? i'm available to jump on a call and I can show you the errors we are getting. I'm free for the next 6 hours.

khalwat commented 1 year ago

Unfortunately I'm not free today, but you can book time next week here:

https://savvycal.com/nystudio107/chat

danfathom commented 1 year ago

Brilliant thanks, i've scheduled one for Friday afternoon

danfathom commented 1 year ago

@khalwat I'm currently on the call we scheduled, are you going to join?

khalwat commented 1 year ago

Apologies Dan, I just sent you an email about rescheduling it. I'm sorry I missed this meeting.

khalwat commented 1 year ago

You may want to set up a proper queue job runner:

https://nystudio107.com/blog/robust-queue-job-handling-in-craft-cms

danfathom commented 1 year ago

Hi,

Following our call from before Christmas, we discussed to add some cron jobs and contact Semrush about obeying the 'retry-after' command.

I have contacted Semrush with no luck, unfortunately they have not been very helpful in this case and said they will look into it and then I haven't heard anything back after chasing them a few times.

I have added 2 cron jobs, 1 to run crafts queue every 30 minutes and another every day at 4am to generate SEOmatics sitemaps.

We are still getting the sitemap errors in Semrush. I don't suppose you have any other ideas we could try?

Thanks!

khalwat commented 1 year ago

It's really hard for me to fix a problem that I'm unable to reproduce.

If you're able to reproduce you seeing a 503 (not SEMrush) and can tell me how I can reproduce it here, I'd love to get this solved.

I've done code review, and especially with the stale while revalidate changes, I'm not seeing how it can be happening.

The only other wildcard is your actual site and the devops used on it.

danfathom commented 1 year ago

I understand it's tricky when we are unable to replicate the issue.

If i'm able to replicate the 503 error I will let you know, but in the meantime we will ignore the sitemap errors from Semrush.

Thanks for your help

estebancastro commented 1 year ago

I have the same problem for over a year: https://imgur.com/xkAHUQB updates don't solve it.

SEOmatic Sitemaps settings: https://imgur.com/cpnwn6T

Craft/Plugins updated to the last versions:

Laravel Forge Deploy Script:

# Site folder
cd /home/forge/mysite.com

# Fetch the newest state
git pull origin $FORGE_SITE_BRANCH

# Install all node prod dependencies
npm install

# Install all composer prod dependencies with an optimized autoloader
$FORGE_COMPOSER install --no-dev --no-interaction --prefer-dist --optimize-autoloader

# Runs pending migrations and applies pending project config changes
$FORGE_PHP craft up

# Run Laravel mix
npx mix --production

# Clear cache
$FORGE_PHP craft clear-caches/all
$FORGE_PHP craft blitz/cache/refresh

# Generate Sitemaps
# $FORGE_PHP craft seomatic/sitemap/generate

# Restart the FPM service
( flock -w 10 9 || exit 1
    echo 'Restarting FPM...'; sudo -S service $FORGE_PHP_FPM reload ) 9>/tmp/fpmlock

Note: Maybe silly but as you see one of my attempts to fix this it was comment/disable craft seomatic/sitemap/generate to avoid generate sitemaps after each update expecting sitemaps stay without changes more time, but don't makes any difference.

If I visit right now https://mysite.com/sitemaps-1-sitemap.xml I don't end with a 503, is hard to catch the 503 because as you see on the Semrush screenshot it happens during different days/hours.

I remember found some 503 on the logs, something like:

[04/Oct/2022:22:37:15 +0000] "GET /es/sitemaps-1-section-activitiesPage-2-sitemap.xml HTTP/1.1" 503 392 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

[02/Oct/2022:17:40:26 +0000] "GET /robots.txt HTTP/1.1" 503 162 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)"

Any solution? This is the only issue related with this site/SEO/SEOmatic that I still can't fix.

Thank You!

khalwat commented 1 year ago

I’m not convinced that this is an issue with SEOmatic, it is complying with what Google says should be done if the sitemap isn’t available, setting it to a 503 with a Retry-After header:

https://github.com/nystudio107/craft-seomatic/blob/develop-v4/src/models/SitemapTemplate.php#L214

SEOmatic is using a stale while revalidate pattern, so it should be very rare that a 503 is actually encountered. But when it does happen, we’re returning the result code that Google says we should:

https://developers.google.com/search/blog/2011/01/how-to-deal-with-planned-site-downtime

So it seems to me like the SEMrush bot is hanging onto this error result code when it shouldn’t; instead it should obey the ‘Retry-After’ header and retry the URL if it runs into an instance where the sitemap isn’t generated, and then clear the error code.

If you check your Google Search Console, you will not find any errors with the sitemaps, which reinforces my thought that this is specific to SEMrush.

khalwat commented 1 year ago

So I've been told that the issue here is that something is clearing the caches every 2-3 days.

SEOmatic does not clear its caches unless explicitly told to... so anyone running into this, what caching method do you have Craft CMS set to use?

Is there any reason you can discern why your caches would be cleared on a regular or semi-regular basis?

danfathom commented 1 year ago

We use Gzip and have the following rules set up in our .htaccess. I don't know if this would have any effect?

Screenshot 2023-07-21 at 09 30 50

khalwat commented 1 year ago

No, when I'm discussing caching, I mean at the Craft level. The above is just the web server config.

https://servd.host/blog/caching-craft-cms

The other thing that has come up in our discussions is that it's possible queue jobs are failing.

When an entry is saved, SEOmatic will invalidate the appropriate sitemap caches, and push a new queue job to generate them.

If that queue job fails, then it won't properly regenerate the sitemap cache, and the next thing that requests it (a SEMrush bot for instance) will get a 503.

So if that's what is happening, we need to find out why the queue jobs are failing to run. Most likely it is due to a devops/setup issue. I'd recommend having a look at:

https://nystudio107.com/blog/robust-queue-job-handling-in-craft-cms

and also:

https://craftcms.com/docs/4.x/queue.html

khalwat commented 1 year ago

Also if queue jobs are failing, check your log files: queue.log & possibly console.log and web.log to find out why.

We need information on why queue jobs are failing to complete, if this is indeed what is going wrong here.

estebancastro commented 1 year ago

About the cache on my site:

  1. Forge Deploy Script, this runs after each craft, plugins update or any production changes on the repository:
    # Clear cache
    $FORGE_PHP craft clear-caches/all
    $FORGE_PHP craft blitz/cache/refresh

note: I disable that after August 01, 2023 due this Semrush issue, maybe helps stop to clear cache after each craft/plugin/repo updates? we'll see if Semrush continue reporting the issue.

  1. I use Blitz plugin, this are the settings related with cleaning cache: https://imgur.com/a/j2MFPMQ

  2. Cleaning cache history (basically the cache has been cleared after each craft/plugins updates):

Jul 19, 2023 Jul 16, 2023

Jun 17, 2023 Jun 5, 2023 Jun 4, 2023

May 24, 2023 May 22, 2023 May 16, 2023 May 8, 2023

  1. queue.log https://mega.nz/file/SYglwJbT#zyBdhxFZLQeHCR5XL8-usjTu6-XLHNOXaOt-e4OfiTE

Maybe that info helps?

khalwat commented 1 year ago

FWIW, I clear all caches when I deploy to my sites.