Fixing SEO: Site-maps FE (Should)

carcruz commented 8 months ago

BE - sitemap.xml is generated by command line tool (BE)

Process improvement for every release (Prashant)

jdhayhurst commented 8 months ago

Hey guys, I haven't done any concrete research but here's what I can gather so far.

Why are pages not being indexed by google?

Number one reason by far is that the pages are not yet crawled.

Why are our pages not being crawled by google?

In order for google to index our pages, it first crawls (requests) them, and then based on the content that was found, it may index the page. Crawling is done by the googlebot and costs something so there is a budget. Having a large number of pages submitted, heavy/slow pages, pages deemed to be of lower qualitity, duplicate pages, or pages that we don’t want to even index spends the budget unnecessarily and pushes the pages that we do want indexed further down the queue. This may be why many of our pages have not been crawled.

What can we do?

Easy

Exclude any pages that we don’t want to index by adding those to the robots.txt exclude directive. E.g. the evidence pages.
Stop submitting urls in the sitemaps for pages that may not appear to be substantially different from each other (similar pages will appear as duplicates). e.g. stop submitting association pages
Stop submitting urls in the sitemaps for pages that don’t contain information that is worth indexing. e.g. stop submitting association pages
Limit the submitted urls to just the pages that contain value once indexed. Think more about what the contents of the pages are that we want to index.

Harder

Improve the quality/performance of pages. E.g. one thing google hates is https://web.dev/articles/cls?sjid=8466149573106955559-EU - this is apparently happening on evidence pages even though they are not in our sitemaps
Add a condition that if the request comes from a bot, return a simple more useful (for a bot) response - e.g. on the association pages we could return a text summary
Add json ld to the page sources https://bioschemas.org/ to enrich our look on google - could be overkill given that we have our own search is strong...but it exists for a reason

mbdebian commented 5 months ago

@carcruz @prashantuniyal02 and @jdhayhurst , are we exploring the JSON-LD approach for embedded metadata in Open Targets Platform like in other life sciences resources, e.g. identifiers.org?

prashantuniyal02 commented 3 weeks ago

@mbdebian and @carcruz to review the relevant in a couple of month.

opentargets / issues