nhoffman / ya16sdb

A curated subset of 16S rRNA sequences from NCBI
2 stars 3 forks source link

regex bug for long genome accessions #37

Closed dhoogest closed 4 years ago

dhoogest commented 4 years ago

For example, NZ_CAADIT010000001 from accession CAADIT010000001 is transformed by the regular expression in https://github.com/nhoffman/ya16sdb/blob/f429540104b6c151acd65a968a51446aa33fcd63/bin/extract_genbank.py#L26 to ADIT01000000. Results in duplicate records for this genome

crosenth commented 4 years ago

https://ncbiinsights.ncbi.nlm.nih.gov/2019/03/22/new-accession-formats-refseq-release-93/

crosenth commented 4 years ago

For record keeping these are the current formats:

Screen Shot 2019-08-08 at 10 12 37 AM

https://www.ncbi.nlm.nih.gov/Sequin/acc.html

Screen Shot 2019-08-08 at 10 19 27 AM

https://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_accession_numbers_and_mole/?report=objectonly

Also keep in mind generated Refseq accessions do not follow the same rules as regular accessions. For example:

Screen Shot 2019-08-08 at 10 29 34 AM

Has a Refseq prefix followed by only numerals.

crosenth commented 4 years ago

https://github.com/nhoffman/ya16sdb/commit/e26350798482c8b8535e011d2242230430047e0a