improve SEO to enable GigaDB datasets to be found by google/bing/beidu searches

only1chunts commented 4 years ago

User Story

As a Website User I want to find GigaDB content through search engines So that I can conveniently find GigaDB content relevant to my needs

Acceptance Criteria

Given GigaDB is optimised for Google When I search "A molecular map of lung neuroendocrine neoplasms" Then the corresponding manuscript in GigaScience should appears on the first page

Given GigaDB is optimised for Google When I search "A molecular map of lung neuroendocrine neoplasms" Then corresponding dataset in GigaDB should appears on the first page

Given GigaDB is optimised for Bing When I search "A molecular map of lung neuroendocrine neoplasms" Then the corresponding manuscript in GigaScience should appears on the first page

Given GigaDB is optimised for Bing When I search "A molecular map of lung neuroendocrine neoplasms" Then corresponding dataset in GigaDB should appears on the first page

Given GigaDB is optimised for Baidu When I search "A molecular map of lung neuroendocrine neoplasms" Then the corresponding manuscript in GigaScience should appears on the first page

Given GigaDB is optimised for Baidu When I search "A molecular map of lung neuroendocrine neoplasms" Then corresponding dataset in GigaDB should appears on the first page

Given GigaDB is optimised for Yandex When I search "A molecular map of lung neuroendocrine neoplasms" Then the corresponding manuscript in GigaScience should appears on the first page

Given GigaDB is optimised for Yandex When I search "A molecular map of lung neuroendocrine neoplasms" Then corresponding dataset in GigaDB should appears on the first page

Additional Info

Product Backlog Item Ready Checklist

[ ] Business value is clearly articulated
[ ] Item is understood enough by the IT team so it can make an informed decision as to whether it can complete this item
[ ] Dependencies are identified and no external dependencies would block this item from being completed
[ ] At the time of the scheduled sprint, the IT team has the appropriate composition to complete this item
[ ] This item is estimated and small enough to comfortably be completed in one sprint
[ ] Acceptance criteria are clear and testable
[ ] Performance criteria, if any, are defined and testable
[ ] The Scrum team understands how to demonstrate this item at the sprint review

Product Backlog Item Done Checklist

[ ] Code is complete
[ ] Automated tests related to the changes are implemented and passing
[ ] All automated test suites are passing locally
[ ] Code is refactored to best practices and coding standards
[ ] Documentation is updated as needed
[ ] A Pull Request has been created and review requested
[ ] Pull Request is reviewed and approved
[ ] The item has been merged to the develop branch
[ ] All automated test suites are passing on continuous Integration pipeline and item is ready to release

Is your feature request related to a problem? Please describe. historically GigaDB datasets have appeared in google search results, but recently they have stopped appearing? This can be seen if you try searching for ANY word using google with the restriction of "site:www.gigadb.org" e.g. genome site:www.gigadb.org

Describe the solution you'd like Its probably that the solution will require the use of schema.org , according to #73 this has been implemented on the home page, but needs to be extended to the dataset pages (and all other gigadb.org pages).

only1chunts commented 4 years ago

It might also be that we need to check whether any of the bot-blocker stuff that was done to prevent bots crawling the FTP server have accidentally affected gigadb.org too?

only1chunts commented 3 years ago

It appears that the dataset pages contain these lines in the HTML:

<meta name="robots" content="noindex">
<meta name="googlebot" content="noindex">

Could that be causing the lack of indexing in google? to see this I used the rich-results checker tool, e.g.: https://search.google.com/test/rich-results?utm_campaign=sdtt&utm_medium=url&id=rkYdN_AWFjpXO_JUaPx9Gg

only1chunts commented 3 years ago

The rich-results checker link in the previous comment shows the metadata is being made available but with a number of issues:

Page loading issue: Not all page resources could be loaded. This can affect how Google sees and understands your page. Fix availability problems for any resources that can affect how Google understands your page. ERROR - Invalid object type for field "license" Warning - Missing field "creator" (optional) Warning - Invalid value type for field "license" (optional) Warning - Missing field "encodingFormat" (optional)

only1chunts commented 5 months ago

@ChrisArmit makes a good point when testing the release:

Given GigaDB is optimised for Google When I search "A molecular map of lung neuroendocrine neoplasms" Then the corresponding manuscript in GigaScience should appears on the first page

—-This works

Given GigaDB is optimised for Google When I search "A molecular map of lung neuroendocrine neoplasms" Then corresponding dataset in GigaDB should appears on the first page

—-This does not work

Infact when you add the word gigadb to the search term in google it does find the correct GigaDB but shows it like this:

If you click the learn why button it takes you here: https://support.google.com/webmasters/answer/7489871?hl=en

luistoptal commented 5 months ago

@only1chunts The robots.txt is disallowing any crawlers including google bots to read any of the pages of the site

Suggested change:

replace content of robots.txt with this:

User-agent: *
Allow: /

prevent bots from indexing admin pages, by disallowing them in robots.txt adding additional rulees
remove <meta name="robots" content="noindex, nofollow"><meta name="googlebot" content="noindex, nofollow"> from public pages

rija commented 5 months ago

@only1chunts The robots.txt is disallowing any crawlers including google bots to read any of the pages of the site

Suggested change:

replace content of robots.txt with this:
User-agent: *
Allow: /
prevent bots from indexing admin pages, by disallowing them in robots.txt adding additional rulees

remove <meta name="robots" content="noindex, nofollow"><meta name="googlebot" content="noindex, nofollow"> from public pages

Interesting. Up until the switchover, we disallowed search engine indexing on non-live environment as we don't want non-live data to be public. After the switchover, we didn't change that setting but we saw in the logs that search engines bots was indexing the new web site anways thus we though they didn't respect the no-indexing directives. In any case, the directive needs to be reversed, but on live only. We still don't want our dev and staging environment to appear on search engine. I think it's better I create a specific ticket for switching indexing directive based on environments (as it's not as trivial as it appears).

Another issue that affect SEO is that search engines still know gigadb.org as and http site. and don't see https://gigadb.org as the same site because we don't have yet explicit redirection from http to https (there's issue #1799 for that but no yet implemented).

Finally we may have to manually re-index our website on the various search engines using their dashboards and search tools to have a better control on how we show up (and maybe to ensure that the old http site is no longer indexed)

gigascience / gigadb-website