identifiers-org / identifiers-org.github.io

MIT License
8 stars 1 forks source link

Issue in insdc.gca namespace record #232

Closed andrewyatz closed 9 months ago

andrewyatz commented 10 months ago

Hi there

I've noted an issue in the insdc.gca namespace which I believe will cause certain identifiers to be unresolvable. The identifier pattern is currently ^GC[AF]_[0-9]{9}\.[0-9]+$ and permits two types of genome identifier to be resolved

  1. GCAs - The INSDC accession for a genome (available from ENA and Genbank/NCBI)
  2. GCFs - The RefSeq accession for a genome (available only from RefSeq/NCBI)

The latest release of the human assembly GRCh38.p14 (hg38) has the GCA GCA_000001405.29 and GCF GCF_000001405.40. The first will resolve correctly in identifiers.org (to ENA) but the second will resolve to an invalid URL.

I believe the following could happen to resolve this:

  1. The pattern for insdc.gca is altered to ^GCA_[0-9]{9}\.[0-9]+$
  2. A separate namespace is created for GCFs e.g. refseq.gcf or ncbi.gcf is created with the pattern GCF_[0-9]{9}\.[0-9]+$ and a resolver provider going to a URL such as https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.40

Any help would be appreciated

renatocjn commented 10 months ago

Hello,

Thank you for bringing this to our attention.

I will have to investigate a bit on this. We had issues with ENA before where an entry took a while to be available even if available in NCBI. So this may be just a matter of time until it shows in ENA. If this is the the case, we might make NCBI primary so that all resolutions go there, until we can discuss the issue with the ENA team.

Your proposal could work but we might not want to simply remove GCF resolution from the insdc.gca namespace because others may have used links to those. Changing the pattern would break those references.

For the moment, please use the ncbi provider code to always resolve to ncbi. Your link would be http://identifiers.org/ncbi/insdc.gca:GCF_000001405.40

I will get back to you on that.

andrewyatz commented 10 months ago

Thanks for the consideration and suggestions around how to mitigate the issue too. Really useful. The main point I do want to make though is GCF accessions are not part of the INSDC collaboration exchange so I do think their inclusion in the record/pattern is a mistake. But considering the above URL example of using an alternative provider code I can also see why that would be an issue too.

Thank you

renatocjn commented 10 months ago

Sorry for the delay. I had a look and spoke with my colleague about it and you are correct. We will be creating a new namespace as you recommended before. Thank you for bringing this up to us.

renatocjn commented 10 months ago

Done, Please have a look and let me know if there is anything else that should be changed insdc.gca and insdc.gcf

andrewyatz commented 10 months ago

Thank you for reviewing this and thank you for the progress and changes.

I have a couple of points to make but nothing major:

renatocjn commented 10 months ago

Good points.

renatocjn commented 9 months ago

Hello @andrewyatz I spoke with Henning and we decided to change the prefix of the new namespace to refseq.gcf We generally advise against changing prefixes but since it is unrelated to insdc and it is a very new namespace, it shouldn't be that big of a deal. It was also my mistake since I missed the last part of your original message where you proposed the prefix name.

cthoyt commented 9 months ago

FYI renaming caused issues in the Bioregistry since we import Identifiers.org on a daily basis

andrewyatz commented 9 months ago

Many thanks @renatocjn I appreciate the work you've done here and also apologise @cthoyt for the knock on effects this has had

renatocjn commented 9 months ago

Yes. Sorry @cthoyt, it was my fault for not paying enough attention when reading his original message. Hopefully, it wasn't too difficult of an issue.