pirovc / genome_updater

Bash script to download/update snapshots of files from NCBI genomes repository (refseq/genbank) with track of changes and without redundancy
MIT License
139 stars 14 forks source link

Different number of entries in different query #56

Closed GreyGuoweiChen closed 2 years ago

GreyGuoweiChen commented 2 years ago

Hello @pirovc , When I download the bacteria proteins from refseq using genome_updater, the number of entries varies among every query. Do you have any idea why this happen? And I use comment like _genome_updater.sh -c "representative genome" -g "bacteria" -d "refseq" -f "protein.faa.gz" -o "testrefseq" -t 32 -m -k

And this occureed to me: 1:

2:

Any suggestions would help.

pirovc commented 2 years ago

I think there's an issue when downloading the assembly_summary.txt, either from an unstable connection of something with the NCBI servers that is causing this. It happened to me lately as well. I'm implementing a file checker to be released in the next version to hopefully avoid this issue, I'll update this thread when it is available.

For now an alternative is to manually download the bacterial refseq assembly_summary.txt, make sure it's complete and use as an external input in genome_updater:

./genome_updater.sh -e assembly_summary.txt -d "refseq" -c "representative genome" -f "protein.faa.gz" -o "test_refseq" -t 32 -m -k

pirovc commented 2 years ago

There were some improvements implemented in the new version (v0.5.0) to solve this problem. Please give it a try and re-open this issue if the problem persists.