hlxsites / servicenow

Apache License 2.0
2 stars 3 forks source link

Configure robots.txt for engine crawling #250

Closed andreituicu closed 4 months ago

andreituicu commented 4 months ago

Problems

What google currently sees: https://search.google.com/test/rich-results/result?id=5-WxbiwaRUA7GqzuPx808w

Problem 1 Servicenow has the following entry in their robots.txt (https://www.servicenow.com/robots.txt):

Disallow: /*?

This prevents robots from crawling images, because EDS uses the following pattern media_${hash}.${extension}?width=${size}&format=$[format}&optimize=medium

So the following entry should be added in the in their robots.txt , by the customer.

Allow: */media_*?*

This disallow rule also affects API calls (like index lookups, tags, etc.)

Problem 2 We were too aggressive with the blog specific disallow rules. Fragments should be allowed to crawl, even though they are not allowed to be indexed

Solution

So the following consolidated changes to robots.txt need to be done: REMOVE:

Disallow: /blogs/fragments/*

Disallow: /uk/blogs/fragments/*

Disallow: /fr/blogs/fragments/*

Disallow: /de/blogs/fragments/*

Disallow: /nl/blogs/fragments/*

ADD:

Allow: */media_*?*
Allow: /blogs/query-index.json?*
Allow: /blogs/tags.json?*

note: the disallow rules for the drafts folders must still remain in place.