naobservatory / mgs-pipeline

MIT License
4 stars 2 forks source link

Simon bowtie2 db #35

Closed simonleandergrimm closed 7 months ago

simonleandergrimm commented 7 months ago

Hey @jeffkaufman .

Here is the complete build_bowtie2_db.py. I marked this PR as a draft pull request. Some things to consider:

I will let build_bowtie2_db.py run overnight and see how far it gets during that time. I will report back on how that goes. In the meantime, feel free to review the code and the README, and assign the PR back to me once you found time to review it.

simonleandergrimm commented 7 months ago

Successfully ran the entire script throughout the night, though the combined_genomes.fna file seems too short?

[ec2-user@ip-172-31-32-232 bowtie]$ wc -l detailed-taxids.txt
127639 detailed-taxids.txt
[ec2-user@ip-172-31-32-232 bowtie]$ wc -l ../human-viruses.tsv
28017 ../human-viruses.tsv
[ec2-user@ip-172-31-32-232 bowtie]$ grep -o '>' combined_genomes.fna | wc -l
41925
[ec2-user@ip-172-31-32-232 bowtie]$
jeffkaufman commented 7 months ago

Successfully ran the entire script throughout the night, though the combined_genomes.fna file seems too short?

That looks right to me? I ran this a while ago, and got 39445 genomes.

The number of human viruses isn't that high, and genbank still has some filter on addition.

simonleandergrimm commented 7 months ago

Ok good to know. will merge.