refresh-bio / KMC

Fast and frugal disk based k-mer counter
256 stars 73 forks source link

paired-end sequencing fastq file #133

Closed xuyiran0609 closed 2 years ago

xuyiran0609 commented 4 years ago

Hi, how to use paired-end sequencing fastq file, should I put the path of two fastq file into a list file and use option @input_file_names. And if I have many paired-end fastq file of different samples, should I put the path of paired-end fastqs of all samples into a list file, or I need create list file for every sample.

marekkokot commented 4 years ago

Hi, it depends on what you need. In general, KMC is not aware of the paired-end files. It will just count each k-mer in the input files. If you have paired-end files of two samples, A and B: A_1.fastq, A_2.fastq, B_1.fastq, B_2.fastq. If you put the path to those files in file list KMC will create a single k-mers database which will contain all k-mers that are present in all input files. The database does not contain information about the source of a k-mer, it is just k-mer and its count. If you need a separate database for sample A and sample B you should create two input files, one for sample A and one for sample B and run KMC twice. As a result you will get two KMC databases, each for a specific sample.

Does it answer your question?

SC-Duan commented 4 years ago

Hi,

Does KMC count both strand? Like jellyfish '-C'? I want to extract the reads which a kmer sequence comes from, should I think about the reverse compliment?Or can I just grep the Kmer sequence against my PE sequencing fastq files?

Thank you!

marekkokot commented 4 years ago

Hi, in the default mode, KMC counts both strands, i.e. it chooses canonical k-mers. If you need to have k-mers directly as they are in the reads you should use -b switch. Let me know if it helps.

SC-Duan commented 4 years ago

Yes, it is helpful, thank you!