naobservatory / mgs-pipeline

MIT License
4 stars 2 forks source link

build bowtie: avoid duplicates #44

Closed jeffkaufman closed 7 months ago

jeffkaufman commented 7 months ago

One reason why this is currently very slow is that our initial list includes some nodes that are taxonomic children of other nodes, and so the full expansion will include duplicates. Remove duplicates from detailed-taxids.txt before calling ncbi-genome-download.

Before this change detailed-taxids.txt was ~80% duplicates.