plone / volto

React-based frontend for the Plone Content Management System
https://demo.plone.org/
MIT License
447 stars 607 forks source link

plones default robots.txt prevents google indexing if ?expand used #4898

Open djay opened 1 year ago

djay commented 1 year ago

Describe the bug

Default robots.txt includes the rule

Disallow: /*?

If you use expansion to improve performance of you volto theme your content urls then look similar to

https://digitalnsw.pretagov.com.au/++api++/?expand=actions,breadcrumbs,navigation&expand.navigation.depth=2

the googlebot then can't crawl this and this then results in a "soft 404" (as seen in google search console) and google won't include any of the pages in your site in it's index.

The soft 404 is caused by another bug whereby if the content api call has a problem that it doesn't understand it defaults to 404 not found being rendered but with a 200 status code. (This problem in itself causes other issues since what should be a 500 error doesn't appear as such in GA or server logs.)

In addition another default rule prevents /preview images from being loaded by google bot which could cause indexing issues

Disallow: /*view$

To Reproduce

  1. add expansion to your site see https://training.plone.org/effective-volto/backend/writing-content-expansion.html
  2. Enable search console for your site
  3. Do inspect url and test live url on any of your main content urls

TODO: there is perhaps a more direct way to test this by using something to simulate blocking /*? urls in the browser?

Expected behavior

Google indexes the page fine.

Screenshots

image image

Proposed solution

Other solutions considered

Not really clear the best way forward

Software (please complete the following information):

Additional context

Add any other context about the problem here.

arsenico13 commented 11 months ago

We just run into this issue and, as a temporary patch, we added this row to robots.txt:

Allow: /*?expand*

We are testing it right now to see if this solves this issue.