ncbi / datasets

NCBI Datasets is a new resource that lets you easily gather data from across NCBI databases.
https://www.ncbi.nlm.nih.gov/datasets
Other
355 stars 39 forks source link

Genes without locus tag in GCF_030052815.1 #397

Open manulera opened 3 weeks ago

manulera commented 3 weeks ago

Hello,

I have two questions. Not sure if this is a generic NCBI issue, or related to the datasets API. Happy to forward the query elsewhere.

I came across this problem recently for the genome of Hevea brasiliensis - taxid 3981 - reference genome assembly GCF_030052815.1.

I thought that having locus_tags was a requirement for genomes to be deposited / queried in the NCBI. However, it seems like the genes in the nuclear genome of this assembly do not have locus_tags:

https://ncbi.nlm.nih.gov/datasets/gene/GCF_030052815.1/?search=rubber

Question 1: is it to be expected that locus_tags are missing, or is it an issue with this assembly in particular?

I went to the refseq (https://www.ncbi.nlm.nih.gov/nuccore/NC_079493.1/) and GenBank (https://www.ncbi.nlm.nih.gov/nuccore/CM057502.1?report=genbank&log$=seqview) records. Below is an example of the same CDS in both records:

NC_079493

     CDS             complement(642191..642643)
                     /gene="LOC110662440"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Gnomon."
                     /codon_start=1
                     /product="ferredoxin, root R-B2"
                     /protein_id="XP_021677096.2"
                     /db_xref="GeneID:110662440"
                     /translation="MATVTVPSQCMVKIAPKNQFASTIIKNPCSLGSVRSISKSFRLK
                     CSQNFKASMAVYKIKLIGPEGEEQEFDAADDTYILDAAENAGVELPYSCRAGACSTCA
                     GKMVSGSVDQSDGSFLDETQMKEGYLLTCISYPTSDCVIYTHQESELC"

CM057502

     CDS             complement(642191..642643)
                     /locus_tag="P3X46_000044"
                     /codon_start=1
                     /product="hypothetical protein"
                     /protein_id="KAJ9188672.1"
                     /translation="MATVTVPSQCMVKIAPKNQFASTIIKNPCSLGSVRSISKSFRLK
                     CSQNFKASMAVYKIKLIGPEGEEQEFDAADDTYILDAAENAGVELPYSCRAGACSTCA
                     GKMVSGSVDQSDGSFLDETQMKEGYLLTCISYPTSDCVIYTHQESELC"

Here the sequence of both is identical but only one has a locus_tag. There are also cases where there are features that exist in one but not the other.

Question 2: is it common that the annotations in GenBank and RefSeq records differ?

Thank you so much for your help!

Best, Manu