refresh-bio / PHIST

Phage-Host Interaction Search Tool
GNU General Public License v3.0
27 stars 2 forks source link

Strange behavior with changing -k flag #5

Closed snayfach closed 1 year ago

snayfach commented 2 years ago

I know the README recommends a kmer size of between 25-30 bp, but I was hoping to use a longer size (to avoid kmer matches for CRISPR spacers).

I tried kmers of 25, 30, 50 and 100 bp. The behavior of 25 and 30 bp was expected -- the top hits were the same, but fewer kmer matcher when k=30. For k=50, the top hit changed and it contained 2x as many kmer matches. For k=100, there are multiple hits reported for each query genome, each with many kmer hits.

I'd suggest having the program raise an error if the user supplies a kmer value above 30 and also including this information in the help text. Ideally, it would be great to be able to use longer kmer lengths.

agudys commented 2 years ago

Hello,

Unfortunately, the current version of PHIST does not support k larger than 30 due to internal k-mer representation. Therefore, all the results you got for k > 30 are sensless. We agree, that this information should be given more explicitly in the help and the package should raise an error when an illegal k is given. We will fix this in the next realese.

As for allowing longer k-mers, this would require fundamental changes in the algorithm and data structures. We will mark this as a TODO feature but I would not expect it to appear anytime soon. Sorry!

Regards, Adam