refresh-bio / KMC

Fast and frugal disk based k-mer counter
266 stars 73 forks source link

kmer presence issue #184

Closed chklopp closed 2 years ago

chklopp commented 2 years ago

I'm trying to find kmer present twice in a genome assembly fasta file (using version kmc 3.1.1)

kmc -v -k23 -m60 -t4 -ci2 -cx2 -fm genome.fa double tmp kmc_dump double double.dump

This seems OK

head double.dump AAAAAAAAATGAATATTTATTTG 2 AAAAAAAATGAATATTTATTTGG 2 AAAAAAACACTTCGTCTTTGCGA 2 AAAAAAATGACGAAAACACTTCG 2 AAAAAACACTTCGTCTTTGCGAG 2 AAAAAACTGCGCATACTCCAACT 2 AAAAAAGTGATTGCGCAGTTCTT 2 AAAAAATGACGAAAACACTTCGG 2 AAAAAATTTACTAAAATATTCAT 2 AAAAACACTTCGTCTTTGCGAGT 2

NB. all counts are = 2

But when I check for the kmer presence (and reverse presence) in the genome (using grep on a tabulated genome file = one sequence per line)

AAAAAAAAATGAATATTTATTTG 1 CAAATAAATATTCATTTTTTTTT 1 AAAAAAAATGAATATTTATTTGG 1 CCAAATAAATATTCATTTTTTTT 1 AAAAAAACACTTCGTCTTTGCGA 1 TCGCAAAGACGAAGTGTTTTTTT 1 AAAAAAATGACGAAAACACTTCG 1 CGAAGTGTTTTCGTCATTTTTTT 1 AAAAAACACTTCGTCTTTGCGAG 1 CTCGCAAAGACGAAGTGTTTTTT 1 AAAAAACTGCGCATACTCCAACT 0 AGTTGGAGTATGCGCAGTTTTTT 2 AAAAAAGTGATTGCGCAGTTCTT 1 AAGAACTGCGCAATCACTTTTTT 0 AAAAAATGACGAAAACACTTCGG 1 CCGAAGTGTTTTCGTCATTTTTT 1 AAAAAATTTACTAAAATATTCAT 1 ATGAATATTTTAGTAAATTTTTT 1 AAAAACACTTCGTCTTTGCGAGT 1 ACTCGCAAAGACGAAGTGTTTTT 1

I have some inconsistencies such as : AAAAAAGTGATTGCGCAGTTCTT 1 AAGAACTGCGCAATCACTTTTTT 0

For the first 101 kmers, I get the following counts 1 0 20 1 80 2

Why are some kmers not a the awaited count?

marekkokot commented 2 years ago

Hi,

could you send me your input file (genome.fa)? I will try to reproduce your results and find the reason. Could you also show how exactly you use the grep command?

chklopp commented 2 years ago

Hi,

Unfortunately I can not share this genome but I have made the same analysis on a public genome : https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz

Here is the link to the kmc dump file of the doubles http://genoweb.toulouse.inra.fr/~klopp/tmp/b2dec665f3e04333b064.dump

and the tab version of the genome http://genoweb.toulouse.inra.fr/~klopp/tmp/genome.fa.upper.tab

Here is my grep forward/reverse command for i in cut -f 1 b2dec665f3e04333b064.dump | head -100; do BB=grep -c $i genome.fa.upper.tab; AA=echo $i | awk 'BEGIN{ a["A"]="T";a["C"]="G";a["G"]="C";a["T"]="A";}{split($0,b,""); for (i=1;i<=length($0);i++){c=a[b[i]]""c;}}END{print c}'; CC=grep -c $AA genome.fa.upper.tab; echo $i $BB $AA $CC; done AAAAAAAGTTTTGGTAAAAACTC 2 GAGTTTTTACCAAAACTTTTTTT 0 AAAAAAGTTTTGGTAAAAACTCG 2 CGAGTTTTTACCAAAACTTTTTT 0 AAAAACGAAAAAAAAAAAAAAAA 1 TTTTTTTTTTTTTTTTCGTTTTT 1 AAAACGAAAAAAAAAAAAAAAAA 1 TTTTTTTTTTTTTTTTTCGTTTT 1 AACGAAAAAAAAAAAAAAAAAAA 1 TTTTTTTTTTTTTTTTTTTCGTT 1 AACGAAAAAAAAAAAAAAAATAA 1 TTATTTTTTTTTTTTTTTTCGTT 0 AATCTTTTCGACTCTGTGACCTC 0 GAGGTCACAGAGTCGAAAAGATT 2 AATTTTGTTTTTCGCCCACATAA 1 TTATGTGGGCGAAAAACAAAATT 1 ACAGAGTCGAAAAGATTCCAGAA 2 TTCTGGAATCTTTTCGACTCTGT 0 ACAGATTTATCTGCCAATGTTGC 1 GCAACATTGGCAGATAAATCTGT 1 ACATTATGTGGGCGAAAAACAAA 1 TTTGTTTTTCGCCCACATAATGT 1 ACATTCTGTCCCACATACTAAAT 1 ATTTAGTATGTGGGACAGAATGT 0 ACCGTTAGAGTCGACATCGAAAG 2 CTTTCGATGTCGACTCTAACGGT 0 ACGCCAAAACAAGAACAAAAGGA 1 TCCTTTTGTTCTTGTTTTGGCGT 1 ACGCTTCTTTCTTGCAAAGACAA 0 TTGTCTTTGCAAGAAAGAAGCGT 2 ACTCTTGCTTTTGCTTCTTCCTC 1 GAGGAAGAAGCAAAAGCAAGAGT 1 ACTTAACAACCAAGGAACCAACA 0 TGTTGGTTCCTTGGTTGTTAAGT 2

The line in bold corresponds to one problematic case.

marekkokot commented 2 years ago

Ok, I think I have the reason. grep -c returns the number of lines containing given string. k-mer AACGAAAAAAAAAAAAAAAATAA occurs twice in chromosome >NC_001144.5 Saccharomyces cerevisiae S288C chromosome XII, complete sequence. In your upper.tab file it is a single line so grep gives you 1. It seems you may use:

#/bin/bash
for i in `cut -f 1 b2dec665f3e04333b064.dump | head -100`; 
do 
    BB=`grep -o $i genome.fa.upper.tab | wc -l`;
    AA=`echo $i | awk 'BEGIN{ a["A"]="T";a["C"]="G";a["G"]="C";a["T"]="A";}{split($0,b,""); for (i=1;i<=length($0);i++){c=a[b[i]]""c;}}END{print c}'`; 
    CC=`grep -o $AA genome.fa.upper.tab | wc -l`; 
    echo $i $BB $AA $CC; 
done

I haven't checked other inconsistencies. If something is still unclear let me know.

If you need to count unique k-mers per sequence (read/chromosome) it is not currently natively supported (it was, but it seems there are no use cases, so we dropped it for simplicity). Currently adding it is not easy (but probably doable, yet requiring some work). I may suggest some workarounds, but probably not very efficient (it may not be relevant to you depending on the time required to perform other steps of your pipeline). If you need the workaround let me know. If everything is clear now, please close the issue.

chklopp commented 2 years ago

It is very clear and thank you for taking the time to debug my script.