refresh-bio / KMC

Fast and frugal disk based k-mer counter
252 stars 73 forks source link

How to merge k-mer frequencies #237

Closed GLking123 closed 1 week ago

GLking123 commented 2 weeks ago

Hello, first of all, thank you for giving us KMC and kmc_tools, which I use frequently.

I have both second-generation sequencing data in fq.gz files and third-generation data in .bam files. How should I run the KMC command? Should I run KMC on each type of data separately and then merge them? Previously, I ran jellyfish separately for each type of data and then combined them, but I encountered issues.

Which k-mer size should I choose for a genome size estimated to be around 50G? In my first attempt, I chose K=21, and GenomeScope estimated the genome size as 30G, which is about 20G less than expected. What would be the correct k-mer size to use?

For the above question, could you provide some debugging suggestions? Thank you for your valuable time and assistance. I sincerely look forward to your response!

marekkokot commented 2 weeks ago

Hi,

thanks for using KMC :)

So, you want a single k-mers database for all these files (2nd generation fq.gz + 3rd generation .bam files), right?. KMC expects exactly one file type for a single run, although there may be any number of input files as long as they are the same type (fq and fq.gz are considered the same). So, in this case, you could run kmc twice. First for fq.gz files, then for .bam, and you can combine them with kmc_tools. This has some drawbacks:

  1. It probably wouldn't work because there is some nasty bug, as far as I remember, for .bam files with long reads (3rd generation)
  2. Assuming you want to remove rare k-mers (for example, these occurring only once) for the correct result (meaning the same result you would get when you run kmc for all your input files in a single run (which is not possible due to mixing fq and bams)) you would need to run kmc with -ci1 and then when combining with kmc_tools set ci for output -ci2 (or other value as you need). The drawback is that the kmc files for -ci1 would probably be large. Alternatively, you may accept that the resulting set of k-mers will not be exactly the case as in the case of a single km run for all the files.

To overcome these two limitations, dump your bam to fastq with one of the existing external tools and then run kmc once for all of them in fastq format.

Disclaimer: I don't know if mixing 2nd and 3rd generation is a good idea. It might make sense for your application, or it may not. I am not competent enough to say if this is a good or bad idea.

Similarly, I may not be the best person to give advice regarding the choice of k. The general rule is that the larger the genome, the larger the k should be chosen. Your data has an enormous genome size. I think that for human data, 31 is often used as k, and 21 may be too small. Your data have a genome size order of magnitude larger, so using more than 21 seems reasonable. You may try with 27. I don't know how GenomeScope works. It is also possible that it is not perfect for such large genomes.

Out of curiosity, what is the total size of your input data?

Thanks again for using KMC :)

GLking123 commented 2 weeks ago

Hello,

Thank you for your answer.

With a total data volume of 2T (1.5T for .bam and 500G for .fq), I was considering processing them separately before merging to avoid memory overflow.

My approach involves converting the .bam file into .fastq format and then splitting it into short reads of 150bp. This seemed appropriate, and I successfully managed it by using Jellyfish to process segments and then merge them.

However, the estimated size of the genome turned out to be smaller than expected.

marekkokot commented 2 weeks ago

I see, thanks. Quite large data :) I dont know how much memory you have on your machine, but chances are KMC will be able to process it with less 100G (its just a guess). There is -m parameter to limit the amount of memory (Default 12GB), but KMC will use more if needed. Yet, there is also -sm parameter causing it will not use more at a cost of additional disk operations which may increase running time. Im just saying as an alternative approach.

I'm not sure if it is necessary to split the long reads into short of 150bp, KMC should handle mixed reads lengths, even mixing 2nd and 3rd generation as long as they are in the same format. There are just issues if the long reads are in bam format. Splitting long reads intuitively to me sounds like losing some info. On the other hand counting k-mers is also splitting to short sequences, so maybe it does not matter. Just sharing my thoughts.

I'm not sure if I answered all your questions, so if I missed something let me know. Thanks for using KMC.

GLking123 commented 1 week ago

Hello, I've been trying this for the past few days. I'll share the results with you as soon as I have them.

marekkokot commented 1 week ago

Hi, I'm not sure if I am the best one to give adivices regarding k choice, 31 sounds reasonable, but maybe even more would work better. Maybe creators of GenomeScope would be able to advice better on this matters. I'm not sure what you mean by debugging suggestions here

GLking123 commented 1 week ago

Thank you for your reply, I now know how to do it.

GLking123 commented 1 week ago

Hi,

When I run kmc_tools, some errors appear in the generated files.The command is as follows: kmc_tools transform $name histogram "new_kmer31_"$name".histo" -ci2

The input file is as follows: image

The result is as follows: image

I don't know why the generated .histo file is truncated at line 253, and all the remaining data is accumulated at line 253.

Thank you for your valuable time and assistance. I sincerely look forward to your response!

marekkokot commented 1 week ago

Hi, try this:

kmc_tools transform $name histogram "new_kmer31_"$name".histo" -ci2 -cx1000000

all above cx value are collapsed into this value, the default is 255. Also if you generated your kmc database with default cs, all counters above 255 are reduced to 255, so you may need to rerun with higher cs if you need (that is purely motivated by the storage requirements, the assumption was that it does not really matter in many applications if k-mer occurred 255 times or more, and this way we may just use 1 byte for counter, but user is free to increase cs if needed).

GLking123 commented 1 week ago

Hi,

Thank you for your advice, I must reconfigure using a higher CS, otherwise it won't work.