Configure robots.txt for engine crawling

Problems

What google currently sees: https://search.google.com/test/rich-results/result?id=5-WxbiwaRUA7GqzuPx808w

Problem 1 Servicenow has the following entry in their robots.txt (https://www.servicenow.com/robots.txt):

Disallow: /*?

This prevents robots from crawling images, because EDS uses the following pattern media_${hash}.${extension}?width=${size}&format=$[format}&optimize=medium

So the following entry should be added in the in their robots.txt , by the customer.

Allow: */media_*?*

This disallow rule also affects API calls (like index lookups, tags, etc.)

Problem 2 We were too aggressive with the blog specific disallow rules. Fragments should be allowed to crawl, even though they are not allowed to be indexed

Solution

So the following consolidated changes to robots.txt need to be done: REMOVE:

Disallow: /blogs/fragments/*

Disallow: /uk/blogs/fragments/*

Disallow: /fr/blogs/fragments/*

Disallow: /de/blogs/fragments/*

Disallow: /nl/blogs/fragments/*

ADD:

Allow: */media_*?*
Allow: /blogs/query-index.json?*
Allow: /blogs/tags.json?*

note: the disallow rules for the drafts folders must still remain in place.

hlxsites / servicenow

Configure robots.txt for engine crawling #250

Problems

Solution