refresh-bio / KMC

Fast and frugal disk based k-mer counter
266 stars 73 forks source link

Can't count kmer on fastq file #137

Open chequochuu opened 5 years ago

chequochuu commented 5 years ago

I using the latest kmc code but i can't count kmer on fastq file. It work on fasta

$cat r1_test.fq 
@0|Chromosome|4051100|4051286/2 BX:Z:CGACACGGTTTGGGCC
AAACCCAACCAC
+
FFFFFFFFFFFF

$kmc -fq -m5 -ci1 -k3 r1_test.fq res ./tmp/
Stage 1: 100%
1st stage: 0.000393s
2nd stage: 6.4e-05s
Total    : 0.000457s
Tmp size : 0MB

Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :            0
   No. of unique counted k-mers       :            0
   Total no. of k-mers                :            0
   Total no. of reads                 :            1
   Total no. of super-k-mers          :            0
marekkokot commented 5 years ago

Hi,

I am not able to reproduce this bug.

By latest kmc code you mean that you compile commit 85ad76956d890aa24fc8525eee5653078ed86ace?

Could you rerun it with -v switch and send me your output?

Could you try to rerun it with -t1 and check if it still does not work?

chequochuu commented 5 years ago

Yes, I use that commit. Still got error.


Info: Small k optimization on!

******* configuration for small k mode: *******
No. of input files           : 1
Output file name             : res
Input format                 : FASTQ

k-mer length                 : 3
Max. k-mer length            : 256
Min. count threshold         : 1
Max. count threshold         : 1000000000
Max. counter value           : 255
Both strands                 : true
Input buffer size            : 33554432

No. of readers               : 1
No. of splitters             : 1

Max. mem. size               :  5000MB

Max. mem. for PMM (FASTQ)    :  3294MB
Part. mem. for PMM (FASTQ)   :    33MB
Max. mem. for PMM (reads)    :     1MB
Part. mem. for PMM (reads)   :     0MB
Max. mem. for PMM (b. reader):   402MB
Part. mem. for PMM (b. reader):   134MB

Stage 1: 100%
1st stage: 0.000247s
2nd stage: 6.3e-05s
Total    : 0.00031s
Tmp size : 0MB

Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :            0
   No. of unique counted k-mers       :            0
   Total no. of k-mers                :            0
   Total no. of reads                 :            1
   Total no. of super-k-mers          :            0
chequochuu commented 5 years ago

It seem that it doesn't work when reading with barcode included in the read name. When I remove the barcode:

@0|Chromosome|4051100|4051286/2
AAACCCAACCAC
+
FFFFFFFFFFFF

It works like a charm!

marekkokot commented 5 years ago

Hmmm, it is still weird, that it worked on my machine. Maybe I have prepared input file other then yours. Could you maybe send me your file r1_test.fq ?

chequochuu commented 5 years ago

This is all my r1_test.fq

@0|Chromosome|4051100|4051286/2 BX:Z:CGACACGGTTTGGGCC
AAACCCAACCAC
+
FFFFFFFFFFFF
marekkokot commented 5 years ago

Hi, I ment send me a file not its content, because maybe github remove something when you copy paste. It seems unlikely, but currently, I cannot imagine another reason why it works on my machine.

You may also copy what you have pasted here to a new file and check if KMC still produces wrong results on your machine.

chequochuu commented 5 years ago

I have find out that the character between id and barcode is \t instead of space. Sorry, my bad.

marekkokot commented 5 years ago

Ok, thanks for the info. It seems it is the same bug as #42, so I will keep it open to remember to add '\t' support. Anyway, thanks for reporting that issue and thanks for using KMC.

taprs commented 10 months ago

Bump! I ran into the same issue as of today. Would be cool to have it fixed, especially given that many linked-read pipelines produce tabbed headers by default.

richardstoeckl commented 4 months ago

The new versions of Nanopore's Dorado and related tools also produce tabbed headers in their fastq files, so I would also appreciate a fix :)

esdpoort commented 1 month ago

I also ran into this issue with fastq files generated by Dorado v0.7.0 which have tabs in the headers. For now I used seqkit replace to change tabs into spaces as a workaround but it would be nice if kmc could handle tabbed fastq headers.