simonrharris / SKA

Split Kmer Analysis
MIT License
62 stars 1 forks source link

Understanding split kmers produced from fasta files #23

Closed antunderwood closed 5 years ago

antunderwood commented 5 years ago

In trying to understand split kmers produced from fasta files I produced 3 simple and very non-biologically relevant fasta files

ref.fas (a string of 160 Gs)

As expected this produced 1 split kmer as shown by the humanized summary

The 2nd fasta file (sample1) had just one polymorphism (G->A) at position 90

This file produced 31 split kmers as expected but the kmer that should have had an A as the middle base is reported with an N I think

The 3rd fasta file (sample2) had another one polymorphism (G->T) at the same position 90

Similar thing with the split kmer file

And when I calculated the distance between sample 1 and 2 using ska distance sample1.skf sample2.skf -o sample1_vs_sample2 the output showed no SNPs

antunderwood commented 5 years ago

Thanks to a personal communication from @simonrharris the answer to this is fairly obvious if I had thought about it 😳

Because my sequences were non-random there were split kmers with the same flanking sequence of 15 G's and either a central base of G or A in sample 1 and G or T in sample 2. Since this is reference free there is no way to align kmers in repeated sequences

If I make

ska fasta -o sample1 sample1.fas
ska fasta -o sample2 sample2.fas
ska distance sample1.skf sample2.skf -o sample1_vs_sample2

This produces a file sample1_vs_sample2.distances.tsv that rightly determines that there are 2 SNPs