opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Sitemap step implementation #1485

Closed d0choa closed 3 years ago

d0choa commented 3 years ago

This is code complete and requires integration with POS.


As the development team we want a reproducable mechanism to generate sitemaps for releases after 21.04.

The following pages should be included in the sitemaps:

Metadata fields 'priority' and 'frequency' will not be included in the sitemaps.

Static pages will no longer have sitemaps generated.

Technical note: Google has a limit of 50k entries per sitemap, this can be avoided by breaking large sitemap files into smaller ones and listing each in the site index file.

Subtasks

~~ - Add 'drug' to outputs available in https://www.targetvalidation.org/downloads/data ~~

andrewhercules commented 3 years ago

Currently, I generate the sitemaps using a Python notebook, xml-sitemap-generator.ipynb. It's not optimised for performance, but it works as intended.

It takes the lists of targets and diseases and generates all formatted, prettified sitemaps in ~3 seconds.

In terms of output, it generates 5 sitemaps:

Each of these sitemaps is then copied in the /sitemaps directory in the webapp.

As we have changed the structure of the platform-app and moved more of the content into the Platform documentation, there is no need for the static_pages.xml file and so the sitemaps could be generated after the ETL pipelines have been run.

For 21.04 I will generate manually and provide to the front-end team - see #1480.

d0choa commented 3 years ago

We should think about improvements now that we are going to create a new process. Some thoughts:

Any other thoughts?

andrewhercules commented 3 years ago

I agree with included drug profile pages as I had blocked Google from indexing the previous drug summary pages due to issues over urls with ? across the website (e.g. search pages, associations pages with facets, etc.).

We can include all target and disease profile pages provided they have some sort of information. We will get penalised if the content is deemed to be poor or repetitive. Ideally, each profile page should have enough information that is immediately available to distinguish it from other pages (e.g. name, identifier, description, synonyms, cross references).

Breaking any of the sitemap page categories into chunks will work - we will just need to update the index.xml file.

And yes, we can drop priority and frequency to make the files smaller. These were legacy attributes that I used when we were updating sitemaps less frequently.

JarrodBaker commented 3 years ago

This is effectively done, and the code can be found in the ot-sitemaps repository. We will use this for releases going forward.

I've moved it to the Platform Output Support epic so it can be integrated with the other tasks there but it will be a low priority.

andrewhercules commented 3 years ago

That's great - thank you @JarrodBaker! 👍

mbdebian commented 3 years ago

ot-sitemaps is being integrated with Travis (https://github.com/opentargets/ot-sitemap-cli/pull/3) for automated assembly of the jar file that will be run by the provisioner, currently being tested at this terraform provisioner branch

andrewhercules commented 3 years ago

Ticket closed as sitemaps generated by ot-sitemap-cli scripts during deployment process