Open only1chunts opened 4 years ago
It might also be that we need to check whether any of the bot-blocker stuff that was done to prevent bots crawling the FTP server have accidentally affected gigadb.org too?
It appears that the dataset pages contain these lines in the HTML:
<meta name="robots" content="noindex">
<meta name="googlebot" content="noindex">
Could that be causing the lack of indexing in google? to see this I used the rich-results checker tool, e.g.: https://search.google.com/test/rich-results?utm_campaign=sdtt&utm_medium=url&id=rkYdN_AWFjpXO_JUaPx9Gg
The rich-results checker link in the previous comment shows the metadata is being made available but with a number of issues:
Page loading issue: Not all page resources could be loaded. This can affect how Google sees and understands your page. Fix availability problems for any resources that can affect how Google understands your page. ERROR - Invalid object type for field "license" Warning - Missing field "creator" (optional) Warning - Invalid value type for field "license" (optional) Warning - Missing field "encodingFormat" (optional)
@ChrisArmit makes a good point when testing the release:
Given GigaDB is optimised for Google When I search "A molecular map of lung neuroendocrine neoplasms" Then the corresponding manuscript in GigaScience should appears on the first page
—-This works
Given GigaDB is optimised for Google When I search "A molecular map of lung neuroendocrine neoplasms" Then corresponding dataset in GigaDB should appears on the first page
—-This does not work
Infact when you add the word gigadb to the search term in google it does find the correct GigaDB but shows it like this:
If you click the learn why button it takes you here: https://support.google.com/webmasters/answer/7489871?hl=en
@only1chunts The robots.txt is disallowing any crawlers including google bots to read any of the pages of the site
Suggested change:
User-agent: *
Allow: /
prevent bots from indexing admin pages, by disallowing them in robots.txt adding additional rulees
remove <meta name="robots" content="noindex, nofollow"><meta name="googlebot" content="noindex, nofollow">
from public pages
@only1chunts The robots.txt is disallowing any crawlers including google bots to read any of the pages of the site
Suggested change:
- replace content of robots.txt with this:
User-agent: * Allow: /
- prevent bots from indexing admin pages, by disallowing them in robots.txt adding additional rulees
- remove
<meta name="robots" content="noindex, nofollow"><meta name="googlebot" content="noindex, nofollow">
from public pages
Interesting. Up until the switchover, we disallowed search engine indexing on non-live environment as we don't want non-live data to be public. After the switchover, we didn't change that setting but we saw in the logs that search engines bots was indexing the new web site anways thus we though they didn't respect the no-indexing directives. In any case, the directive needs to be reversed, but on live only. We still don't want our dev and staging environment to appear on search engine. I think it's better I create a specific ticket for switching indexing directive based on environments (as it's not as trivial as it appears).
Another issue that affect SEO is that search engines still know gigadb.org as and http site. and don't see https://gigadb.org as the same site because we don't have yet explicit redirection from http to https (there's issue #1799 for that but no yet implemented).
Finally we may have to manually re-index our website on the various search engines using their dashboards and search tools to have a better control on how we show up (and maybe to ensure that the old http site is no longer indexed)
User Story
Acceptance Criteria
Additional Info
Product Backlog Item Ready Checklist
Product Backlog Item Done Checklist
Is your feature request related to a problem? Please describe. historically GigaDB datasets have appeared in google search results, but recently they have stopped appearing? This can be seen if you try searching for ANY word using google with the restriction of "site:www.gigadb.org" e.g. genome site:www.gigadb.org
Describe the solution you'd like Its probably that the solution will require the use of schema.org , according to #73 this has been implemented on the home page, but needs to be extended to the dataset pages (and all other gigadb.org pages).