refresh-bio / KMC

Fast and frugal disk based k-mer counter
277 stars 72 forks source link

How to get kmer count of kmc output database #200

Closed jessicabonnie closed 1 year ago

jessicabonnie commented 2 years ago

Hi! I cannot seem to figure out how to get the standard output of kmc from a file created using kmc_tools simple union. I need to query the output files from both kmc and kmc_tools to retrieve the counting statistics. I have sort of been able to do this previously by passing the kmc file through to create a dummy kmc thusly. (If there is a recommended method to use instead, please let me know.):

kmc -ci1 -cx15 -k14 fasta1 kmc1 tmp
kmc -ci1 -cx15 -k14 -fkmc kmc1 kmc1x tmp

This doesn't seem to work for a kmc produced through union:

kmc -ci1 -cx15 -k14 fasta2 kmc2 tmp
kmc_tools simple kmc1 kmc2 union kmc3
#The only way I have previously been able to get the reported values out of a previous kmc output is to use the following
kmc -ci1 -cx15 -k14 -fkmc kmc3 kmc3x tmp

This method says that this is impossible because I am using KMC3 rather than KMC2?

marekkokot commented 2 years ago

Hi,

thanks for using KMC. I'm not sure what you really want. Do you want to have a textual representation? I will try to clarify some details first.

First of all the -fkmc is dedicated for counting k-mers in the already existing kmc database. Usually, it means that you want to use some lower value of k. In your example, you are using the same k, which will result in having exactly the same data in both kmc1 and kmc1x. This probably does not make sense.

If you want to output k-mers to the standard output you may use: kmc_tools -hp transform <kmc_database_here_without_extension> dump /dev/stdout

-hp is to hide the percentage progress bar to not mix with k-mers data. If you want to have a dump in the file just replace /dev/stdout with path. Now I think that maybe you just want to have some global statistics, in such a case you may use kmc_tools info <kmc_database_here_without_extension>

Now, about the versioning. There are three major versions of KMC: KMC1, KMC2, KMC3. When KMC2 was introduced it required a huge change in the data format, but it was not the case of KMC3. As a result, we have KMC1 format version for version KMC1 software version. KMC2 format version for both KMC2 and KMC3 software versions. Now, in many cases, for technical reasons, the output of kmc_tools is in KMC1 format even if the input is in KMC2 format. Unfortunately, KMC1 lacks some data making it a little more complicated to use as kmc input, therefore we decided to only accept KMC2 format as input. But if for some reason it will turn out that it is important to add kmc1 format as input for k-mer counting we will reconsider this. Nevertheless, I think you may want something different. Let me know if it helps, and if not could you please be more precise about what you want, maybe with some toy example? Anyway, thanks again for using KMC.

Also, I wonder if you really want to use -cx15 it will exclude all k-mers occurring more than 15 times. It is not usually what the user wants, but maybe in your case it makes sense (hard to guess not knowing your whole pipeline, just wanted to ask because maybe you were unaware of the meaning of this param).

Best Marek

jessicabonnie commented 2 years ago

Thanks so much! I am definitely looking for something like kmc_tools info. I specifically am looking for the number of unique kmers in a previously generated kmc database (as opposed to the "total k-mers" value provided by kmc_tools info by default). Is there a way to extract that particular value?

marekkokot commented 2 years ago

Hi! Total k-mers reported by kmc_tools is the total number of unique k-mers. For example for such a trivial input:

@
GCATCATCGAGCATGCTGATGCTGACTGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIII
@
GCATCATCGAGCATGCTGATGCTGACTGA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIII
@
GCATCATCGAGCATGCTGATGCTGACTC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIII

running:

kmc -k28  test.fq o .

gives:

**
Stage 1: 100%
Stage 2: 100%
1st stage: 0.599954s
2nd stage: 0.797338s
Total    : 1.39729s
Tmp size : 0MB

Stats:
   No. of k-mers below min. threshold :            1
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :            3
   No. of unique counted k-mers       :            2
   Total no. of k-mers                :            5
   Total no. of reads                 :            3
   Total no. of super-k-mers          :            3

and running:

kmc_tools info o

gives:

k                 :  28
total k-mers      :  2
cutoff max        :  1000000000
cutoff min        :  2
counter size      :  1 bytes
mode              :  occurrence counters
both strands      :  yes
database format   :  KMC2.x
signature length  :  9
number of bins    :  512
lut_prefix_len    :  4

Oh, unless you mean the total no. of unique k-mers in the input and not in the resulting kmc databaset (which stores unique k-mers after filtering). In such a case there is no way to get this from existing database. It is just not stored there.

So as I understand you have some already generated KMC databases but cannot/don't want to rerun kmc to get the total number of unique k-mers in the input?

jessicabonnie commented 1 year ago

kmc_tools info does not appear to be an option in my build and I'm not seeing it in the docs in the repo. How do I access this command? Do I need to change branches to use this feature?

marekkokot commented 1 year ago

What version are you running? It is possible that kmc_tools info is not documented (it was created for our internal needs, but I will need to document it because it may be useful for others). It is also possible that there is no help message, but just kmc_tools info db should work, where db is a kmc database, i.e. you need to have db.kmc_pre and db.kmc_suf files.

jessicabonnie commented 1 year ago

Is there a way to use one kmc_tools info command to get these statistics for each individual database in a list of them? I have tried to use the @inputfiles approach but it doesn't seem to be cooperating. I so appreciate you, kmc, and this command!

marekkokot commented 1 year ago

You mean you computed a list of kmc databases, for example db1.kmc_pre, db1.kmc_suf db2.kmc_pre, db2.kmc_suf db3.kmc_pre, db3.kmc_suf

and you have a file like this (lets name in input.txt):

db1
db2
db3

And you want to have kmc_tools for each database? Why not just write some simple bash/python/other script to do this? This is slightly modifies script by chatGPT to do this:

#!/bin/bash

# Check if the file path is provided as an argument
if [ $# -eq 0 ]; then
  echo "Please provide the file path as an argument."
  exit 1
fi

file_path=$1

# Check if the file exists
if [ ! -f "$file_path" ]; then
  echo "File not found: $file_path"
  exit 1
fi

# Read each line from the file and pass it as an argument to the application
while IFS= read -r line; do
  # Execute your application command with the line as an argument
  kmc_tools info "$line"
done < "$file_path"

Let me know if that is what you wanted.

jessicabonnie commented 1 year ago

Yes! For reasons I wanted to do it inline rather than calling on a script, but I wrote myself up a for loop so it's all done. Sorry to have bothered you, I just wanted to check if the functionality already existed before I put a hat on a hat! Thanks so much for the great tool!