identifiers-org / identifiers-org.github.io

MIT License
8 stars 1 forks source link

Regex for BioSample doesn't validate example #203

Closed athalhammer closed 2 years ago

athalhammer commented 2 years ago

I'm referring to the regex:

^SAM[NED](\w)?\d+$

in

https://registry.identifiers.org/registry/biosample

that can't match SAMEA2397676 due to the second A (SAME-->A<--2397676) in the LUI

athalhammer commented 2 years ago

I would propose to implement some type of automation pipeline that validates the examples against the provided regular expressions as this seems to be a recurrent issue.

cthoyt commented 2 years ago

Hi @athalhammer, you might have noticed from the issue tracker that the Identifiers.org isn't really able to respond anymore. I'd suggest checking out the Bioregistry project (https://github.com/biopragmatics/bioregistry and https://bioregistry.io) for something similar that's being actively maintained and encourages community feedback.

With respect to your question, I think this works alright on Identifiers.org with https://identifiers.org/biosample:SAMEA2397676 as their page suggests - the A that you're pointing out in your comment seems to get matched to (\w)? which lets you have an optional letter following the SAME before the 2397676.

This also works fine on the Bioregistry at https://bioregistry.io/biosample:SAMEA2397676

cthoyt commented 2 years ago

Also FYI the Bioregistry is 100% open source and open data, so it's able to implement much more detailed CI to make sure exactly stuff like this is consistent. For example, the following code makes sure that all example identifiers match the regular expressions for each record:

https://github.com/biopragmatics/bioregistry/blob/ce6abf3b2a893d9072e59a9e1cfd5df8b2d5aa2f/tests/test_data.py#L297-L335

athalhammer commented 2 years ago

Thanks @cthoyt, you are completely right! I misinterpreted the optional\w character. Thanks also for all the additional pointers!

cthoyt commented 2 years ago

@athalhammer Please feel free to get in touch on the Bioregistry issue tracker or @ me if you find that something's missing from either service! I am not myself affiliated with Identifiers.org but am one of the developers/maintainers of Bioregistry. We also just put a preprint last week (https://www.biorxiv.org/content/10.1101/2022.07.08.499378v2) 🚀