kblin / ncbi-acc-download

Download files from NCBI Entrez by accession
Apache License 2.0
111 stars 8 forks source link

Entrez search support? #8

Open peterjc opened 6 years ago

peterjc commented 6 years ago

This is outside the current scope of the tool, but would you consider adding NCBI Entrez search support as an alternative to supplying the accessions directly?

e.g. https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nucleotide&term=opuntia%5BORGN%5D+accD&retmax=10&idtype=acc

(In Biopython, handle = Entrez.esearch(db="nucleotide", retmax=10, term="opuntia[ORGN] accD", idtype="acc") or similar)

This currently gives three accessions, EF590893.1, EF590892.1, HQ620723.1, which I can download with:

$ ncbi-acc-download EF590893.1 EF590892.1 HQ620723.1

I would like to be able to do something this to achieve the same result:

$ ncbi-acc-download -search "opuntia[ORGN] accD" -retmax 10
peterjc commented 6 years ago

Test example,

$ conda install entrez-direct
$ esearch -db nucleotide -query "its1 AND Phytophthora[Organism] AND 150:800[Sequence Length]"
 | efetch -format fasta > ncbi_sample.fasta
$ grep -c "^>" ncbi_sample.fasta 
2246
$ grep "^>" /tmp/ncbi_sample.fasta  | head
>MG255148.1 Phytophthora palmivora isolate TARI p98158 18S ribosomal RNA gene, partial sequence; internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence
>LC159493.1 Phytophthora drechsleri genes for ITS1, 5.8S rRNA, ITS2, partial and complete sequence, isolate: PhWa20140918-2
>LC159492.1 Phytophthora drechsleri genes for ITS1, 5.8S rRNA, ITS2, 28S rRNA, partial and complete sequence, isolate: PhWa20140918-1
>LS479897.1 Phytophthora capsici genomic DNA sequence contains 18S rRNA gene, ITS1, 5.8S rRNA gene, ITS2, 28S rRNA gene, strain LL2480
>LS479193.1 Phytophthora infestans genomic DNA sequence contains 18S rRNA gene, ITS1, 5.8S rRNA gene, ITS2, 28S rRNA gene, strain XD15
>LS479173.1 Phytophthora infestans genomic DNA sequence contains 18S rRNA gene, ITS1, 5.8S rRNA gene, ITS2, 28S rRNA gene, strain 80029
>LS479172.1 Phytophthora infestans genomic DNA sequence contains 18S rRNA gene, ITS1, 5.8S rRNA gene, ITS2, 28S rRNA gene, strain 88069
>LS479171.1 Phytophthora infestans genomic DNA sequence contains 18S rRNA gene, ITS1, 5.8S rRNA gene, ITS2, 28S rRNA gene, strain XA-4
>LS479169.1 Phytophthora infestans genomic DNA sequence contains 18S rRNA gene, ITS1, 5.8S rRNA gene, ITS2, 28S rRNA gene, strain DN111
>LS479127.1 Phytophthora infestans genomic DNA sequence contains 18S rRNA gene, ITS1, 5.8S rRNA gene, ITS2, 28S rRNA gene, strain XD1314
kblin commented 5 years ago

Just realized I forgot to comment on this, sorry about that. I'm not sure that this is the direction I want to go with the tool. I'll have to think about this a bit more.

peterjc commented 5 years ago

That's fine - I appreciate this is a shift in focus.

I can do what I want to easily with entrez-direct, but it is not reliable. For example, it frequently gives partial downloads (not all the records) at busy times, and based on continuous integration results, does not return an error code in this situation.

I was thinking expanding your tool made sense because of your existing sanity checking (number of records returned, basic formatting, etc).