merzilla / inm_googlesitemap

TYPO3 CMS extension. Generates a sitemap.xml via Scheduler Task by crawling the given URL (recursively) from frontend point of view. It uses PHPCrawl library.
0 stars 1 forks source link

obey robots.txt is not working #5

Open notacoder-ui opened 3 years ago

notacoder-ui commented 3 years ago

Hi,

I have set a rule in robots.txt that Disallow: /mailto:%20iasdf%66o%40r%65%69asdfdf%2ede Disallow: /news-letter/unsub

And started the cron job to index but the job always indexed the above both urls.

How to skip some urls not getting indexed in the sitemap.

notacoder-ui commented 3 years ago

And also I got this in my typo3 error log:

Mon, 21 Dec 2020 05:56:00 +0000 [ERROR] request="e909aaa824fb3" component="INM.InmGooglesitemap.Generators.SitemapGenerator": Extension inm_googlesitemap: Error Code: 5 --- Reason: Socket-stream timed out (timeout set to 5 sec).

This error log which made site to show the 503 error and restarting the php-fpm service showed the site again.

Please check this too

merzilla commented 3 years ago

Hi @notacoder-ui , there may be that other rules overlay the stuff from your robots. Please provide more information: TYPO3 version, PHP version etc. And of course the settings you made in the Scheduler task are important.

notacoder-ui commented 3 years ago

Okay.

Typo3 version : 9.5.19 PHP version: 7.2.34 Settings in schedular:

settings

merzilla commented 3 years ago

Okay, well adding mailto to regexDirectoryExclude will not help you... this will exclude something like https://foo.tld/mailto/something . But you may add news-letter here instead of the mailto to exclude this path. What you also can shorten is linkExtractionTags: Update this field that you only have href there. I know, mailto is also in a href. But let me know if it's better now. If not I would have to check why mailto is not omitted by default.

notacoder-ui commented 3 years ago

Hi @merzilla

I updated the settings as you said and ran the cron job. Site went to 503 mode and I got to see this in the error log:

Tue, 22 Dec 2020 05:05:01 +0000 [ERROR] request="08db8edc7ac5b" component="INM.InmGooglesitemap.Generators.SitemapGenerator": Extension inm_googlesitemap: Response Header not correct. Got HTTP Status Code 302 for URL https://www.xyz.de/mailto:%20%69n%66%6f%40%72eise%6cinie%2e%64e --- Complete Response Header: HTTP/1.1 302 Found Date: Tue, 22 Dec 2020 05:05:01 GMT Server: Apache X-Powered-By: PHP/7.2.34 location: /404fehler X-Powered-By: PleskLin X-UA-Compatible: IE=edge X-Content-Type-Options: nosniff Cache-Control: public, no-transform, must-revalidate Last-modified: Mon, 14 Dec 2020 10:10:10 GMT Content-Length: 0 Connection: close Content-Type: text/html; charset=UTF-8

notacoder-ui commented 3 years ago

Hi @merzilla

I need to update settings like some links should not be indexed while generating a sitemap.xml file. Is there any possibility that I can set it to avoid such a URL?

Or obey robots.txt functionality is also fine for me so that I can set URLs there with disallow and that is not getting indexed while generating a new sitemap.xml