ropensci / rentrez

talk with NCBI entrez using R
https://docs.ropensci.org/rentrez
Other
194 stars 38 forks source link

`entrez_fetch` returns multiple records from the `nuccore` database when given one ID #172

Open ajrominger opened 2 years ago

ajrominger commented 2 years ago

Thanks for making this package! It’s a huge help. After a lot of successful use, I’m finding some very strange behavior. Basically when I try to fetch nucleotide data for one ID I’m getting many records back. Any help understanding what's going on, if there's a bug, or if I'm doing something wrong would be a huge help!

Here’s an example:

# get record 
raw <- rentrez::entrez_fetch(db = 'nuccore', id = '403062956',
                             rettype = 'native', retmode = 'xml',
                             parsed = TRUE)
## No encoding supplied: defaulting to UTF-8.
# transform into a named vector (XML structure preserved in the names...
# there's probably a much better way to do this)
rawList <- unlist(XML::xmlToList(raw))

The ID should produce a result for the species Drosophila murphyi. But the query in fact produces many records, most of which are not for Drosophila murphyi:

# find all the times a species name is mentioned
allTax <- rawList[grep('Org-ref_taxname', names(rawList))]
names(allTax) <- NULL

# there are a lot of times
length(allTax)
## [1] 212
# most are not for the expected species
sum(allTax == 'Drosophila murphyi')
## [1] 3

If I ask for the FASTA file it behaves as expected and gives me one record for the right species:

rawFASTA <- rentrez::entrez_fetch(db = 'nuccore', id = '403062956',
                             rettype = 'fasta')
## No encoding supplied: defaulting to UTF-8.
cat(rawFASTA)
## >JN815406.1 Drosophila murphyi voucher M09059 elongation factor 1 gamma (EF-1g) gene, partial cds
## CAAATGTCTGACCGAGTCGAATGCCATTGCCTACTTTTTGGCCAATGAGCAGCTGCGTGGCGGCAAATGT
## CCGCTGGTGCAGGCTCAGGTGCAGCAATGGATCTCATTCGCTGACAATGAAATCTTGCCTGCGTCCTGCG
## CATGGGTGTTCCCACTGCTCGGCATAATGCCGCAGCAGAAGAATGCGAATGTGAAACGGGACGTTGAGGT
## TGTGCTGCAGCAGCTGAACAAGAAGCTGTTGGATGCCACTTACCTCGCCGGTGAACGCATCACGTTGGCC
## GACATTGTTGTCTTCTGCACCCTGCTCCATTTGTATGAGCATGTRCTGGATTCAAGTGCACGCAGTGCGT
## ACGGCAATCTGAACCGTTGGTTCGTCACCATCCTCAATCAGCCGCAGGTGAAGGCTGTTGTCAAGGACTT
## TAAGCTGTGCGAAAAGGCGCTCGTCTTTGATCCCAAGAAGTACGCCGAATTCCTGGCCAAGACTGGCGGT
## GCCAAGCCCCAGCAGGCGCCCAAGTCCAAGGATGAGAAAAAGGCCAAGAAGGAAGCGGCACCCGCACCCG
## AAMCCGAGGAGCTCGATGCTGCCGATGCCGCKTTGGCTATGGAGCCCAAGTCCAAGGATCCGTTTGATGC
## CATGCCCAAGGGCACGTTCAATTTCGATGACTTCAAGCGTGTCTATTCCAATGAGGAAGAGGCCAAGTCC
## ATTCCCTATTTCTTTGAGAAATTCGATGCCGAGAACTATTCGATCTGGTTTGGCGAATACAAATACAACG
## AAGAACTGACCAAGACTTTCATGTCCTGCAATCTGATCGGTGGCATG
dwinter commented 2 years ago

Hi @ajrominger ,

Well that's pretty odd...

Just replying now to let you know this is the first week ofa new position for me, so it'll be a little while before the dust settles and I"m able to get to stuff like this. Will take a look at it when I get a moment and let you know if I can turn anything up.

ajrominger commented 2 years ago

Great, thanks @dwinter! As one more piece of information it seems all the returned records are part of a PopSet, in this case PopSet 403062842.

Congrats and good luck on the new position!