Closed astrojuanlu closed 6 months ago
Taking the liberty here of prioritizing this as High.
Maybe a manual reindex is what's required here? Or a submission of a sitemap?
https://developers.google.com/search/docs/crawling-indexing/ask-google-to-recrawl
I do the site map reindex via Google Search Console all the time.
We had a URL prefix property, so only https://kedro.org
and not everything under the kedro.org
domain.
Requested a DNS change to LF AI & Data https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/IT-26615
"Indexed, though blocked by robots.txt"
(┛ಠ_ಠ)┛彡┻━┻
https://support.google.com/webmasters/answer/7440203#indexed_though_blocked_by_robots_txt
Indexed, though blocked by robots.txt
The page was indexed despite being blocked by your website's robots.txt file. Google always respects robots.txt, but this doesn't necessarily prevent indexing if someone else links to your page. Google won't request and crawl the page, but we can still index it, using the information from the page that links to your blocked page. Because of the robots.txt rule, any snippet shown in Google Search results for the page will probably be very limited.
Next steps:
- If you do want to block this page from Google Search, robots.txt is not the correct mechanism to avoid being indexed. To avoid being indexed, remove the robots.txt block and use 'noindex'.
Important: For the noindex rule to be effective, the page or resource must not be blocked by a robots.txt file, and it has to be otherwise accessible to the crawler. If the page is blocked by a robots.txt file or the crawler can't access the page, the crawler will never see the noindex rule, and the page can still appear in search results, for example if other pages link to it.
https://developers.google.com/search/docs/crawling-indexing/block-indexing
Previous discussion about this on RTD https://github.com/readthedocs/readthedocs.org/issues/10648
We got some good advice https://github.com/readthedocs/readthedocs.org/issues/10648#issuecomment-2021128135
But blocking this on #3586
Potentially related:
@astrojuanlu It will be very helpful to have access of the Google search console, can we catch up sometime this week? In addition, despite https://github.com/kedro-org/kedro/pull/3729, it appears the robots.txt isn't updated.
I am not super clear about RTD build, do we need to manually refresh the robots.txt
somewhere or it only get updated for release?
See: https://docs.kedro.org/robots.txt
To customize this file, you can create a robots.txt file that is written to your documentation root on your default branch/version.
https://docs.readthedocs.io/en/stable/guides/technical-docs-seo-guide.html#use-a-robots-txt-file
The default version (currently stable
) has to see a new release for this to happen.
We need to make sure sitemap is crawled. See example of vizro
User-agent: *
Disallow: /en/0.1.9/ # Hidden version
Disallow: /en/0.1.8/ # Hidden version
Disallow: /en/0.1.7/ # Hidden version
Disallow: /en/0.1.6/ # Hidden version
Disallow: /en/0.1.5/ # Hidden version
Disallow: /en/0.1.4/ # Hidden version
Disallow: /en/0.1.3/ # Hidden version
Disallow: /en/0.1.2/ # Hidden version
Disallow: /en/0.1.11/ # Hidden version
Disallow: /en/0.1.10/ # Hidden version
Disallow: /en/0.1.1/ # Hidden version
Sitemap: https://vizro.readthedocs.io/sitemap.xml
Ours is blocked currently.
This isn't the primary goal of this ticket but we can also look into it. The main goal of the ticket is "Why URLs that we don't want to be indexed get index", though we would definitely love to improve the opposite "Why URLS that we want to be indexed isn't".
This is very clear that our robots.txt
is just wrong
Mind you, we don't want to index /en/latest/
. The rationale is that we don't want users to land on docs that correspond to an unreleased version of the code.
Updated robots.txt
in https://github.com/kedro-org/kedro/pull/3803
Will continue on this after the release - next sprint
Our sitemap still cannot be indexed
Renaming this issue, because there's nothing else to investigate - search engines (well, Google) will index pages blocked by robots.txt
because robots.txt
is not the right mechanism to deindex pages.
Addressed in #3885, keeping this open until we're certain the sitemap has been indexed.
(robots.txt
won't update until a new stable
version is out)
robots.txt
got updated 👍
Descriptions
Even with
robots.txt
search engines still index pages that are listed as disallowed.Task
"We need to upskill ourselves on how Google indexes the pages, RTD staff suggested we add a conditional
<meta>
tag for older versions but there's a chance this requires rebuilding versions that are really old, which might be completely impossible. At least I'd like engineering to get familiar with the docs building process, formulate what can reasonably be done, and state whether we need to make any changes going forward." @astrojuanluContext and example
https://www.google.com/search?q=kedro+parquet+dataset&sca_esv=febbb2d9e55257df&sxsrf=ACQVn0-RnsYyvwV7QoZA7qtz0NLUXLTsjw%3A1710343831093&ei=l8bxZfueBdSU2roPgdabgAk&ved=0ahUKEwi7xvujx_GEAxVUilYBHQHrBpAQ4dUDCBA&uact=5&oq=kedro+parquet+dataset&gs_lp=Egxnd3Mtd2l6LXNlcnAiFWtlZHJvIHBhcnF1ZXQgZGF0YXNldDILEAAYgAQYywEYsAMyCRAAGAgYHhiwAzIJEAAYCBgeGLADMgkQABgIGB4YsANI-BBQ6A9Y6A9wA3gAkAEAmAEAoAEAqgEAuAEDyAEA-AEBmAIDoAIDmAMAiAYBkAYEkgcBM6AHAA&sclient=gws-wiz-serp (thanks @noklam)
Result: https://docs.kedro.org/en/0.18.5/kedro.datasets.pandas.ParquetDataSet.html
However, that version is no longer allowed in our
robots.txt
:https://github.com/kedro-org/kedro/blob/1f2adf12255fc312ab9d429cbf6f851a13947cf3/docs/source/robots.txt#L1-L9
And in fact, according to https://technicalseo.com/tools/robots-txt/,