refresh-bio / KMC

Fast and frugal disk based k-mer counter
277 stars 72 forks source link

how to run KMC with assembly genome (fasta) #176

Open Aannaw opened 2 years ago

Aannaw commented 2 years ago

hello I am confusion with the command for the kmc count kmer with assembly genome (fasta). I actually do not find an example. My command is kmc -k21 -ci0 -t40 -m20 -fa a.fasta ./tmp. No error is present but the program is clapsed. Looking forward with reply. Thanks very much.

marekkokot commented 2 years ago

Hi,

there is also -fm switch for multi-fasta (fasta where sequence may span multiple lines). Let me know if it helps.

Aannaw commented 2 years ago

I only have a assembly genome. Actually I want to assess my assembly genome after running purge_dups and another is to compare the kmer counts of the assembly genome and illumina short reads. I run -fm with only one fasta , and it seems useless.

marekkokot commented 2 years ago

I don't know the purge_dups tools. You may count k-mers in multiple files. Assume you have a bunch of multi-fasta files. Create a file files.txt where per each line you store the path to one of the multi-fasta file. For example

file1.fa
file2.fa

You may run kmc as follows:

kmc -k21 -ci1 -t40 -fm @files.txt 21mers .

Does it help?

Aannaw commented 2 years ago

I create a file a.txt with only a fasta files : a.fasta Then I run with kmc -k21 -ci1 -t40 -fm @a.txt tmp The standard out is: K-Mer Counter (KMC) ver. 3.1.0 (2018-05-10) Usage: kmc [options] kmc [options] <@input_file_names> Parameters: input_file_list - file name with list of input files in specified (-f switch) format (gziped or not) Options: -v - verbose mode (shows all parameter settings); default: false -k - k-mer length (k from 1 to 256; default: 25) -m - max amount of RAM in GB (from 1 to 1024); default: 12 -d - trimmed-off bases; default: 0 -sm - use strict memory mode (memory limit from -m switch will not be exceeded) -p - signature length (5, 6, 7, 8, 9, 10, 11); default: 9 -f<a/q/m/bam> - input in FASTA format (-fa), FASTQ format (-fq), multi FASTA (-fm) or BAM (-fbam); default: FASTQ -ci - exclude k-mers occurring less than times (default: 2) -cs - maximal value of a counter (default: 255) -cx - exclude k-mers occurring more of than times (default: 1e9) -b - turn off transformation of k-mers into canonical form -r - turn on RAM-only mode -n - number of bins -t - total number of threads (default: no. of CPU cores) -sf - number of FASTQ reading threads -sp - number of splitting threads -sr - number of threads for 2nd stage -j - file name with execution summary in JSON format -w - without output Example: kmc -k27 -m24 files.lst NA.res /data/kmc_tmp_dir/

No file is created and no error information is found.

marekkokot commented 2 years ago

You have an message:

Usage:
kmc [options] <input_file_name> <output_file_name> <working_directory>
kmc [options] <@input_file_names> <output_file_name> <working_directory>

you miss the output_file_name in your command line, use:

kmc -k21 -ci1 -t40 -fm @a.txt output tmp
Aannaw commented 2 years ago

It works! Thanks very much. Can I ask another question? About illumina paired short reads (a.1.fq,a.2.fq), should I run kemr count with creating a file a.fq.txt: a.1.fq a.2.fq and then run with "kmc -k21 -ci1 -t40 -fq @a.fq.txt out tmp"? Does it output the kmers common to the two paird short reads file?

marekkokot commented 2 years ago

It will count each k-mer present in at least one of the input files. Probably for sequencing reads one should set some rationale cutoff (-ci) to remove erroneous k-mers.

Aannaw commented 2 years ago

It is much helpful! Thanks very much

marekkokot commented 2 years ago

No problem. I'm closing this issue. You may reopen if needed.

jermp commented 2 years ago

Hi @marekkokot, I have the very same issue. No matter what combination of parameters I use, I always get a segfault. For example:

./kmc -v -fm -k31 -ci0 -m2 -t1 -sm ecoli1.fasta ecoli1.kmc kmc_tmp_dir

Why?

marekkokot commented 2 years ago

Hi,

I don't think it is the very same issue. It looks much worse. Do you use kmc downloaded from the release page, or maybe from bioconda or maybe you have compiled it on your own? Let me know. Also, could you please send me your input file, i.e. ecoli1.fasta ?

jermp commented 2 years ago

Hi, I cloned the repo from here (Github) and then compiled it on my machine. Compilation works file. Here is the file attached (it is a tiny file).

ecoli1.fasta.gz

jermp commented 2 years ago

These are my commands:

./kmc -v -fm -k31 -ci0 -t1 ecoli1.fasta ecoli1.kmc kmc_tmp_dir
./kmc -v -fm -k31 -ci0 -t1 @list.txt ecoli1.kmc kmc_tmp_dir/

where currently list.txt contains the filepath of just that ecoli1.fasta file.

marekkokot commented 2 years ago

It works on my machine. What is your operating system and compiler? And maybe what is your hardware? Just to be sure, do you have kmc_tmp_dir created?

jermp commented 2 years ago

My running gcc on Ubuntu: gcc version 11.2.0 (Ubuntu 11.2.0-7ubuntu2) . I've also tried the release commit (b7de846829f7d8cfd18a3d1285deba6ee8ceffc2) but nothing changes. Of course, I have the tmp directory created.

marekkokot commented 2 years ago

Ok, this is wired :( Could you please try the precompiled release? I may also try to remove -static flag from makefile and also -Wl,--whole-archive and -Wl,--no-whole-archive flags.

jermp commented 2 years ago

I tried another machine of mine (Ubuntu again with gcc) and actually it worked. Very strange indeed. Everything else works correctly on the previous machine.

marekkokot commented 2 years ago

It may be hard for me to track the cause when I am not able to reproduce the error. If you have some time maybe try to run kmc under gdb (some changes in makefile may be needed) to see where it crashes. Maybe, for some strange reason, kmc cannot allocate memory? How much memory does your machine have?

jermp commented 2 years ago

My machines have 128GB of RAM :) Also, why not including some examples in the readme? I see a lot of people got confused or have no idea about how to run this tool. For example: I got these two files now

ecoli1.kmc.kmc_pre
ecoli1.kmc.kmc_suf

which one should I use?

marekkokot commented 2 years ago

Ok, so this is not out of memory :) Strange :( Thanks for the suggestion. We indeed need to improve the readme. Some examples are given in the command line help. I didn't realize a lot of people got confused. This is bad. I thought the opposite is true.

Regarding kmc_pre and kmc_suf files. You should use both because kmc output is split into two files. Alternatively, you could set the output format to KFF, which would be a single file, but probably larger one.

jermp commented 2 years ago

Ok thanks!