Closed simonleandergrimm closed 7 months ago
Successfully ran the entire script throughout the night, though the combined_genomes.fna file seems too short?
[ec2-user@ip-172-31-32-232 bowtie]$ wc -l detailed-taxids.txt
127639 detailed-taxids.txt
[ec2-user@ip-172-31-32-232 bowtie]$ wc -l ../human-viruses.tsv
28017 ../human-viruses.tsv
[ec2-user@ip-172-31-32-232 bowtie]$ grep -o '>' combined_genomes.fna | wc -l
41925
[ec2-user@ip-172-31-32-232 bowtie]$
Successfully ran the entire script throughout the night, though the combined_genomes.fna file seems too short?
That looks right to me? I ran this a while ago, and got 39445 genomes.
The number of human viruses isn't that high, and genbank still has some filter on addition.
Ok good to know. will merge.
Hey @jeffkaufman .
Here is the complete
build_bowtie2_db.py
. I marked this PR as a draft pull request. Some things to consider:gimme-taxa.py
andncbi-genome-download
on the first ten viruses inhuman-viruses.tsv
. This took a long time, so I stopped the process and skipped to the next step of the script:bowtie2-build
.bowtie2-build
worked successfully on a smaller subset of.fna
files and created.bt2
files.I will let
build_bowtie2_db.py
run overnight and see how far it gets during that time. I will report back on how that goes. In the meantime, feel free to review the code and the README, and assign the PR back to me once you found time to review it.