kblin / ncbi-acc-download

Download files from NCBI Entrez by accession
Apache License 2.0
111 stars 8 forks source link

Add genomic range and ORF 'correction' option #22

Closed zdk123 closed 3 years ago

zdk123 commented 3 years ago

This addresses #19, supplying an option to add genomic ranges to an accession download (e.g. the from and to parameters in the request query string).

For large records, this saves a substantial amount of time and bandwidth compared to downloading the whole thing and then subsetting.

Example usage:

ncbi-acc-download NC_007194 --range 1001:9000
ncbi-acc-download NC_007194 -g 1001:9000

While combining multiple accessions with a genomic range triggers an error:

ncbi-acc-download NC_007194 NC_007195 --range 1001:9000

Of course if you are picking arbitrary coordinates like this, it is sometimes the case you'll be in the middle of an ORF. While NCBI won't complain, certain downstream applications I've run into don't like this. Therefore I've also added a correct option in the --extended-validation flag, that would filter these ORFs out. There's also a new unit test for the correction validator (note that correct does not get run when all is specified).

ncbi-acc-download NC_007194 -g 1001:9000 -e correct
kblin commented 3 years ago

Ah, crud, I should have pinned biopython. 1.79 deprecating UnknownSeq strikes again. I'll fix this.

kblin commented 3 years ago

weird, I can't seem to rebase this PR on the current master branch. Maybe you can update it to be based on current master? That'll fix the biopython-related test failure.

zdk123 commented 3 years ago

I'll do that - I originally forked from DarianHole/ncbi-acc-download so that might explain it

zdk123 commented 3 years ago

@kblin rebased

kblin commented 3 years ago

Awesome, thanks. Apart from the wrong version number bump, things look good to me, thanks for the contribution! I'll fix the version number and cut a new release.

zdk123 commented 3 years ago

thanks - I can contribute the usage code over at secmet/mibig-json as well