refresh-bio / KMC

Fast and frugal disk based k-mer counter
252 stars 73 forks source link

kmc_tools filter does not accept large FASTA input #221

Open mscharmann opened 10 months ago

mscharmann commented 10 months ago

Hello, first of all, thank you for giving us KMC and kmc_tools, which I use frequently. Now I am trying to retrieve contigs from a genome assembly which contain kmers from a database using kmc_tools filter (ver. 3.2.1, 2022-01-04). The input to kmc_tools filter is thus in fasta format. Multiple fasta records are in the file (hundreds/thousands) but each sequence is on a single line, not "wrapped" / multi-line. Some sequences are >10 mega-bases or 100 mega-bases long, and the entire fasta file is >1 Gb in size. The input file parameter -fa (nor the undocumented -fm) does not behave as the help message suggests... I always get an

"Error: Wrong input file!"

Edit: this seems to be specific to the very long sequences in both FASTA and FASTQ format; the command succeeds when the sequences therein are only tens of kb long. Faking my genome contigs into FASTQ format does not help.

Many thanks and best regards, Mathias

marekkokot commented 10 months ago

Hi, thank you for using KMC and for reporting this issue. I guess something is wrong with handling long sequences in kmc_tools. I will try to take a look. Would be really helpful if you could share some of your input files causing this.