Discrepancy between number of sequences retrieved via web_history and number of sequence IDs

mbarkdull commented 3 years ago

Hi folks,

I'm having an issue with using rentrez to download gene sequences from NCBI. I know that my species has 17,121 sequences of interest, and indeed when I use entrez_search, it returns 17,121 sequence IDs. However, when I proceed through the workflow to download the sequences, I wind up with only 10,000 sequences in the output of entrez_fetch.

My code is as follows:

# Get gene ids:
mphaGeneIDs <- entrez_search(db = "gene", term = "Monomorium pharaonis[organism] AND (alive[prop])", use_history = TRUE, retmax = 30000)

# Link those gene IDs to the nucleotide sequences:
mphaIDLinks <- entrez_link(dbfrom = "gene", web_history = mphaGeneIDs$web_history, db = "nuccore", cmd = "neighbor_history")

# Pull down those nucleotide sequences:
mphaCDS <- entrez_fetch(db = "nuccore", web_history = mphaIDLinks$web_histories$gene_nuccore, rettype = "fasta")

# Export the sequences:
write(mphaCDS, file = "mpha_cds.fasta")

Any help would be hugely appreciated! -Megan

dwinter commented 3 years ago

Hi @mbarkdull ,

I'm not sure if there is an upper-limit to the number of sequences you can download in one "fetch" (the round number makes it seem like maybe so). But the best approach for very large requests like this is to do them in batches (also a good idea because very large downloads are often interrupted mid-stream).

In this case, I can get sequences 10,001-10,051 using restart and retmax:

past_10k <- entrez_fetch(db = "nuccore", 
                                         web_history = mphaIDLinks$web_histories$gene_nuccore, 
                                         rettype = "fasta", retstart = 1e4+1, retmax=50)
tf <- tempfile()
cat(past_10k, file=tf)
ape::read.dna(tf, format="fasta")

50 DNA sequences in binary format stored in a list.

Mean sequence length: 5855.74 
   Shortest sequence: 380 
    Longest sequence: 25416 

Labels:
XM_036286492.1 PREDICTED: Monomorium pharaonis eukaryotic tr...
XM_036286491.1 PREDICTED: Monomorium pharaonis eukaryotic tr...
XM_036286489.1 PREDICTED: Monomorium pharaonis eukaryotic tr...
XM_036286488.1 PREDICTED: Monomorium pharaonis eukaryotic tr...
XM_012683651.3 PREDICTED: Monomorium pharaonis astakine (LOC...
XM_028190656.2 PREDICTED: Monomorium pharaonis astakine (LOC...
...

Base composition:
    a     c     g     t 
0.303 0.194 0.212 0.291 
(Total: 292.79 kb)

The vignette has an example using these in a for loop in the web history section

dwinter commented 3 years ago

Looks like there is just a limit on the number of records you can get in one call to fetch. Closing the issue now, @mbarkdull , let me know if you have any difficulties getting the sequences you are after.

ropensci / rentrez

Discrepancy between number of sequences retrieved via web_history and number of sequence IDs #160