identifiers-org / identifiers-org.github.io

MIT License
8 stars 1 forks source link

ensembl pattern needs to be updated #193

Closed luciansmith closed 1 year ago

luciansmith commented 2 years ago

As far as I can tell, the following URIs (all taken from various biomodels) should all be valid:

http://identifiers.org/ensembl/ENSG00000049246.14 http://identifiers.org/ensembl/ENSG00000109819.9 http://identifiers.org/ensembl/ENSG00000132326.12 http://identifiers.org/ensembl/ENSG00000179094.16

However, the 'ensembl' pattern recognizer doesn't think the final '.14' or '.9' is legal. However, looking up those strings, I do find genes annotated in that manner, cf:

https://gtexportal.org/home/gene/PER3 https://gnomad.broadinstitute.org/gene/ENSG00000109819?dataset=gnomad_r3_non_neuro https://uswest.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000132326;r=2:238244044-238290102

Thus, I believe the ensembl pattern should be updated.

luciansmith commented 2 years ago

Having done more research, I now think that since the '.y' part of the URL is just the version number, it's correct to leave it off. The annotation is to the biological entity itself, not our understanding of the biological entity at a particular moment in time.

luciansmith commented 2 years ago

Well, @jonrkarr thinks the regexp is incorrect after all. He writes:

Given this, I'm re-opening this issue.

jonrkarr commented 2 years ago

The pattern appears to be incorrectly escaped. I believe this is correct (\\ replaced with \):

^((ENS[FPTG]\d{11}(\.\d+)?)|(FB\w{2}\d{7})|(Y[A-Z]{2}\d{3}[a-zA-Z](\-[A-Z])?)|([A-Z_a-z0-9]+(\.)?(t)?(\d+)?([a-z])?))$
cthoyt commented 2 years ago

Since this is a really complicated regular expression, can you give an example of an identifier that wouldn't validate against the old one but would be correct under the new one? While we wait for the Identifiers.org curators to respond, we can directly solve this problem in the Bioregistry (https://bioregistry.io, https://github.com/biopragmatics/bioregistry) so the page for Ensembl (https://bioregistry.io/ensembl) will reflect this and you can use the Bioregistry resolution service as a drop-in replacement for Identifiers.org.

The tentative change is in a pull request at https://github.com/biopragmatics/bioregistry/pull/368, ideally we can add a few extra examples for the CI to check :)

luciansmith commented 2 years ago

Sure! All the identifiers below (and above) found in recent biomodels should match the new but not the old:

http://identifiers.org/ensembl/ENSG00000049246.14 http://identifiers.org/ensembl/ENSG00000109819.9 http://identifiers.org/ensembl/ENSG00000132326.12 http://identifiers.org/ensembl/ENSG00000179094.16

The ones I found linked from the other sources are presumably the same:

[ENST00000264867.7] Bulk tissue gene expression for PER3 (ENSG00000049246.14) Ensembl version ENSG00000132326.12

cthoyt commented 2 years ago

@luciansmith thanks! it's all set, and will get updated on https://bioregistry.io/ensembl on the nightly re-redeployment

cthoyt commented 2 years ago

The following are all working now: https://bioregistry.io/ensembl:ENSG00000049246.14 https://bioregistry.io/ensembl:ENSG00000109819.9 https://bioregistry.io/ensembl:ENSG00000132326.12 https://bioregistry.io/ensembl:ENSG00000179094.16

renatocjn commented 1 year ago

Hello @luciansmith,

I'm a bit confused in the conversation. Could you provide the source of the information that the numbers after the dot are related to the version of the digital artefact? If these are, I propose that we leave the regex as is. That way we block users from publishing links to specific versions. To my mind this is best since it leaves the decision to ensemble of which to version to show. The newest non-deprecated one for example.

luciansmith commented 1 year ago

Leaving the regex as-is would be an obvious mistake: if you want to match the bit before the dot only, you'd leave off the entire last bit. Therefore, it seems that the intention of whoever created the original regex was to match the bit after the dot. (I'm not the best at reading regexps, but I think the current regexp would allow http://identifiers.org/ensembl/ENSG00000179094.\d to match, which is clearly incorrect.)

Linking to specific versions is, I believe, a choice that should be made on a case by case basis. Sometimes it may be desirable to link to a particular version, and sometimes to the underlying base pattern. I don't think it's the job of identifiers.org to insist that one can't link to a particular version, particularly when there are several biomodels and several other examples in the wild (linked in my initial post) that use that format.

I couldn't find anything explicit that says that the .NN is the version; it just seem to be how it's being used. I would hope it's documented somewhere, but if so I couldn't find it.

renatocjn commented 1 year ago

I guess no one is happy reading regexps, especially one like this. But it is true that the first group seems to be made for that. I will try to solve this. I'm just quite apprehensive about changing the regex for Ensembl since it is such a large repository.

About version IDs, my view on this is that identifiers.org's objective is to provide permanent IDs and versioning is very quite transient as a version can be deprecated at any time. While I agree that this is important in specific discussions and development, it should be avoided in cases where a permanent link is necessary, such as academic papers. It would be nice to have some community feedback on this.

I will for now try to fix the escaping in the current version that appears to support this versioning scheme.

renatocjn commented 1 year ago

I have updated the pattern. Please have a look.

I will leave the issue open for a while in case anyone wants to discuss the pattern or if it doesn't match some ID in ensembl.