Understanding split kmers produced from fasta files

simonrharris / SKA

Split Kmer Analysis

MIT License

62 stars 1 forks source link

In trying to understand split kmers produced from fasta files I produced 3 simple and very non-biologically relevant fasta files

ref.fas (a string of 160 Gs)

As expected this produced 1 split kmer as shown by the humanized summary

The 2nd fasta file (sample1) had just one polymorphism (G->A) at position 90

This file produced 31 split kmers as expected but the kmer that should have had an A as the middle base is reported with an N I think

The 3rd fasta file (sample2) had another one polymorphism (G->T) at the same position 90

Similar thing with the split kmer file

And when I calculated the distance between sample 1 and 2 using ska distance sample1.skf sample2.skf -o sample1_vs_sample2 the output showed no SNPs

Thanks to a personal communication from @simonrharris the answer to this is fairly obvious if I had thought about it 😳

Because my sequences were non-random there were split kmers with the same flanking sequence of 15 G's and either a central base of G or A in sample 1 and G or T in sample 2. Since this is reference free there is no way to align kmers in repeated sequences

If I make

a new ref of 1000 random bases and
2 new sequences
- sample1 that has one polymorphism compared to the ref and an extra 31 random bases
- sample2 that has a different polymorphism in the same position compared to the ref and the same extra 31 random bases apart from one polymorphism

ska fasta -o sample1 sample1.fas
ska fasta -o sample2 sample2.fas
ska distance sample1.skf sample2.skf -o sample1_vs_sample2

This produces a file sample1_vs_sample2.distances.tsv that rightly determines that there are 2 SNPs

simonrharris / SKA

Understanding split kmers produced from fasta files #23