Closed d0choa closed 3 years ago
Currently, I generate the sitemaps using a Python notebook, xml-sitemap-generator.ipynb. It's not optimised for performance, but it works as intended.
It takes the lists of targets and diseases and generates all formatted, prettified sitemaps in ~3 seconds.
In terms of output, it generates 5 sitemaps:
Each of these sitemaps is then copied in the /sitemaps directory in the webapp.
As we have changed the structure of the platform-app
and moved more of the content into the Platform documentation, there is no need for the static_pages.xml
file and so the sitemaps could be generated after the ETL pipelines have been run.
For 21.04 I will generate manually and provide to the front-end team - see #1480.
We should think about improvements now that we are going to create a new process. Some thoughts:
priority
or frequency
. It might be a way to make files lighter and simplify the logic. Any other thoughts?
I agree with included drug profile pages as I had blocked Google from indexing the previous drug summary pages due to issues over urls with ?
across the website (e.g. search pages, associations pages with facets, etc.).
We can include all target and disease profile pages provided they have some sort of information. We will get penalised if the content is deemed to be poor or repetitive. Ideally, each profile page should have enough information that is immediately available to distinguish it from other pages (e.g. name, identifier, description, synonyms, cross references).
Breaking any of the sitemap page categories into chunks will work - we will just need to update the index.xml file.
And yes, we can drop priority
and frequency
to make the files smaller. These were legacy attributes that I used when we were updating sitemaps less frequently.
This is effectively done, and the code can be found in the ot-sitemaps repository. We will use this for releases going forward.
I've moved it to the Platform Output Support epic so it can be integrated with the other tasks there but it will be a low priority.
That's great - thank you @JarrodBaker! 👍
ot-sitemaps is being integrated with Travis (https://github.com/opentargets/ot-sitemap-cli/pull/3) for automated assembly of the jar file that will be run by the provisioner, currently being tested at this terraform provisioner branch
Ticket closed as sitemaps generated by ot-sitemap-cli
scripts during deployment process
As the development team we want a reproducable mechanism to generate sitemaps for releases after 21.04.
The following pages should be included in the sitemaps:
Metadata fields 'priority' and 'frequency' will not be included in the sitemaps.
Static pages will no longer have sitemaps generated.
Subtasks
~~ - Add 'drug' to outputs available in https://www.targetvalidation.org/downloads/data ~~