Closed chklopp closed 2 years ago
Hi,
could you send me your input file (genome.fa)? I will try to reproduce your results and find the reason. Could you also show how exactly you use the grep command?
Hi,
Unfortunately I can not share this genome but I have made the same analysis on a public genome : https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_genomic.fna.gz
Here is the link to the kmc dump file of the doubles http://genoweb.toulouse.inra.fr/~klopp/tmp/b2dec665f3e04333b064.dump
and the tab version of the genome http://genoweb.toulouse.inra.fr/~klopp/tmp/genome.fa.upper.tab
Here is my grep forward/reverse command
for i in cut -f 1 b2dec665f3e04333b064.dump | head -100
; do BB=grep -c $i genome.fa.upper.tab
; AA=echo $i | awk 'BEGIN{ a["A"]="T";a["C"]="G";a["G"]="C";a["T"]="A";}{split($0,b,""); for (i=1;i<=length($0);i++){c=a[b[i]]""c;}}END{print c}'
; CC=grep -c $AA genome.fa.upper.tab
; echo $i $BB $AA $CC; done
AAAAAAAGTTTTGGTAAAAACTC 2 GAGTTTTTACCAAAACTTTTTTT 0
AAAAAAGTTTTGGTAAAAACTCG 2 CGAGTTTTTACCAAAACTTTTTT 0
AAAAACGAAAAAAAAAAAAAAAA 1 TTTTTTTTTTTTTTTTCGTTTTT 1
AAAACGAAAAAAAAAAAAAAAAA 1 TTTTTTTTTTTTTTTTTCGTTTT 1
AACGAAAAAAAAAAAAAAAAAAA 1 TTTTTTTTTTTTTTTTTTTCGTT 1
AACGAAAAAAAAAAAAAAAATAA 1 TTATTTTTTTTTTTTTTTTCGTT 0
AATCTTTTCGACTCTGTGACCTC 0 GAGGTCACAGAGTCGAAAAGATT 2
AATTTTGTTTTTCGCCCACATAA 1 TTATGTGGGCGAAAAACAAAATT 1
ACAGAGTCGAAAAGATTCCAGAA 2 TTCTGGAATCTTTTCGACTCTGT 0
ACAGATTTATCTGCCAATGTTGC 1 GCAACATTGGCAGATAAATCTGT 1
ACATTATGTGGGCGAAAAACAAA 1 TTTGTTTTTCGCCCACATAATGT 1
ACATTCTGTCCCACATACTAAAT 1 ATTTAGTATGTGGGACAGAATGT 0
ACCGTTAGAGTCGACATCGAAAG 2 CTTTCGATGTCGACTCTAACGGT 0
ACGCCAAAACAAGAACAAAAGGA 1 TCCTTTTGTTCTTGTTTTGGCGT 1
ACGCTTCTTTCTTGCAAAGACAA 0 TTGTCTTTGCAAGAAAGAAGCGT 2
ACTCTTGCTTTTGCTTCTTCCTC 1 GAGGAAGAAGCAAAAGCAAGAGT 1
ACTTAACAACCAAGGAACCAACA 0 TGTTGGTTCCTTGGTTGTTAAGT 2
The line in bold corresponds to one problematic case.
Ok, I think I have the reason. grep -c returns the number of lines containing given string. k-mer AACGAAAAAAAAAAAAAAAATAA
occurs twice in chromosome >NC_001144.5 Saccharomyces cerevisiae S288C chromosome XII, complete sequence
. In your upper.tab file it is a single line so grep gives you 1.
It seems you may use:
#/bin/bash
for i in `cut -f 1 b2dec665f3e04333b064.dump | head -100`;
do
BB=`grep -o $i genome.fa.upper.tab | wc -l`;
AA=`echo $i | awk 'BEGIN{ a["A"]="T";a["C"]="G";a["G"]="C";a["T"]="A";}{split($0,b,""); for (i=1;i<=length($0);i++){c=a[b[i]]""c;}}END{print c}'`;
CC=`grep -o $AA genome.fa.upper.tab | wc -l`;
echo $i $BB $AA $CC;
done
I haven't checked other inconsistencies. If something is still unclear let me know.
If you need to count unique k-mers per sequence (read/chromosome) it is not currently natively supported (it was, but it seems there are no use cases, so we dropped it for simplicity). Currently adding it is not easy (but probably doable, yet requiring some work). I may suggest some workarounds, but probably not very efficient (it may not be relevant to you depending on the time required to perform other steps of your pipeline). If you need the workaround let me know. If everything is clear now, please close the issue.
It is very clear and thank you for taking the time to debug my script.
I'm trying to find kmer present twice in a genome assembly fasta file (using version kmc 3.1.1)
kmc -v -k23 -m60 -t4 -ci2 -cx2 -fm genome.fa double tmp kmc_dump double double.dump
This seems OK
head double.dump AAAAAAAAATGAATATTTATTTG 2 AAAAAAAATGAATATTTATTTGG 2 AAAAAAACACTTCGTCTTTGCGA 2 AAAAAAATGACGAAAACACTTCG 2 AAAAAACACTTCGTCTTTGCGAG 2 AAAAAACTGCGCATACTCCAACT 2 AAAAAAGTGATTGCGCAGTTCTT 2 AAAAAATGACGAAAACACTTCGG 2 AAAAAATTTACTAAAATATTCAT 2 AAAAACACTTCGTCTTTGCGAGT 2
NB. all counts are = 2
But when I check for the kmer presence (and reverse presence) in the genome (using grep on a tabulated genome file = one sequence per line)
AAAAAAAAATGAATATTTATTTG 1 CAAATAAATATTCATTTTTTTTT 1 AAAAAAAATGAATATTTATTTGG 1 CCAAATAAATATTCATTTTTTTT 1 AAAAAAACACTTCGTCTTTGCGA 1 TCGCAAAGACGAAGTGTTTTTTT 1 AAAAAAATGACGAAAACACTTCG 1 CGAAGTGTTTTCGTCATTTTTTT 1 AAAAAACACTTCGTCTTTGCGAG 1 CTCGCAAAGACGAAGTGTTTTTT 1 AAAAAACTGCGCATACTCCAACT 0 AGTTGGAGTATGCGCAGTTTTTT 2 AAAAAAGTGATTGCGCAGTTCTT 1 AAGAACTGCGCAATCACTTTTTT 0 AAAAAATGACGAAAACACTTCGG 1 CCGAAGTGTTTTCGTCATTTTTT 1 AAAAAATTTACTAAAATATTCAT 1 ATGAATATTTTAGTAAATTTTTT 1 AAAAACACTTCGTCTTTGCGAGT 1 ACTCGCAAAGACGAAGTGTTTTT 1
I have some inconsistencies such as : AAAAAAGTGATTGCGCAGTTCTT 1 AAGAACTGCGCAATCACTTTTTT 0
For the first 101 kmers, I get the following counts 1 0 20 1 80 2
Why are some kmers not a the awaited count?