torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
123 stars 23 forks source link

Avoid producing microvariants if there is no amplicon of that length [dead end] #83

Closed frederic-mahe closed 8 years ago

frederic-mahe commented 8 years ago

The amplicon-space is made of layers of increasingly longer amplicons. When producing microvariants for an amplicon of length L, a majority of them will have a length of L+1 or L-1. However, if there are no amplicon of that length remaining in the pool, it is useless to produce the microvariants.

That optimization requires to store the length distribution of the amplicons remaining in the pool. When an amplicon is removed from the pool, the length distribution is updated: lengths[length] -= 1. For a given amplicon, before producing the microvariants, check if: lengths[length] > 0 for substitutions, lengths[length - 1] > 0 for deletions, and lengths[length + 1] > 0 for insertions.

Tests on the BioMarKs V9 dataset show a very marginal gain (0.02% of microvariants avoided). The Earth Microbiome dataset being more unbalanced (most sequences have a length of 100 or 150 bp), I would expect slightly better results. Nonetheless, the gain should be lesser than a few percents and there is no need to implement that optimization.

frederic-mahe commented 8 years ago

@torognes, I just wanted you to read that, so we all know it is a dead end. I am going to close the issue now.

torognes commented 8 years ago

It is an interesting idea, but as you write the gains will probably rather small. As you have just described there are other possibilities (storing and hashing nucleotides as 2 bit values) that potentially can offer substantial improvements.