refresh-bio / KMC

Fast and frugal disk based k-mer counter
253 stars 73 forks source link

GetCountersForRead requires random access mode #159

Open tbenavi1 opened 3 years ago

tbenavi1 commented 3 years ago

Hello,

I have a dataset where the KMC database is 1.4 Terabytes. I have been using the GetCountersForRead function, but as currently implemented this requires the database be loaded in random access mode and loaded into RAM. Do you have any ideas for how to get the kmer counts across the read for such a large dataset that doesn't require loading the entire database into memory? Thanks for any insights.

marekkokot commented 3 years ago

Hello,

the answer is not easy, but I will try. There is a hidden option in kmc_tools that just checks a single k-mer - it is extremely slow, but it does not require loading the whole database into RAM. If you are interested let me know and I will show you how to use it. I do not recommend use this option, but it may be some fast workaround.

Could you describe your pipeline, especially for which reads (and how many) do you call GetCountersForRead? Is it for the whole input you used to create KMC database?

tbenavi1 commented 3 years ago

Thanks for the quick response. I am interested in the hidden option for kmc_tools. For our pipeline, we would like to be able to call GetCountersForRead on all the reads used to create the KMC database. Depending on the speed of the workaround, we may have to only run on a subset of the reads. Thanks.

marekkokot commented 3 years ago

Here is an example that I hope is self-explanatory

echo "@\n" > test.fastq
echo "ACGACGTACTGACGAGCTGACGTACGGATCGTACGTACGTAGTCGTAC\n" >> test.fastq
echo "+\n" >> test.fastq
echo "IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII\n" >> test.fastq
./kmc -ci1 -k17 test.fastq kmcDB .
./kmc_tools check kmcDB AAAAAAAAAAAAAAAAA
./kmc_tools check kmcDB ACGACGTACTGACGAGC

The output is:

**
Stage 1: 100%
Stage 2: 100%
1st stage: 0.829439s
2nd stage: 0.095974s
Total    : 0.925413s
Tmp size : 0MB

Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :           32
   No. of unique counted k-mers       :           32
   Total no. of k-mers                :           32
   Total no. of reads                 :            1
   Total no. of super-k-mers          :            8
0
1

In general, the count of checked k-mer is printed. I am almost sure it will be extremely slow. Developing a fast algorithm for accessing the database without reading it into RAM would require some effort and I am not sure when I will be able to find a time. Unfortunately, there is a lot of interesting stuff to do, but very often we need to choose those that will turn into publication.

Some speedup could be probably achieved by implementing this functionality into KMC API, but I do not want it in the simplest form in the official API cause it will be slow. If you have time and resources you may try to implement it yourself basing on the way it works in kmc tools. In general it is implemented in: https://github.com/refresh-bio/KMC/blob/master/kmc_tools/check_kmer.h Unfortunately, there is some dependency of this class and other parts of kmc tools code, but maybe the dependency is not that big.

There is also another option that could be useful for you. Maybe in your pipeline, it would be sufficient to only count k-mers having a number of occurrences above some threshold (-ci parameter of kmc). This way KMC database may be much smaller and maybe you would be able to read it into memory. In the default mode, KMC refuses k-mers occurring only once in the input, but maybe in your case higher value is acceptable.

Yet another option is to do as follows. Let assume your input is A.fastq. Let A_1.fastq, A_2.fastq, A_n.fastq be consecutive pieces of the A.fastq (i.e. concatenation of A_1.fastq, A_2.fastq, ..., A_n.fastq is A.fastq).

  1. Count k-mers in A.fastq, let the result be A-kmers
  2. Sort A-kmers using kmc_tools to optimize further operations.
  3. For each A_i.fastq: 3.1 Cont k-mers in A, let the result be A_i-kmers. 3.2 Create intersection of A-kmers and A_i-kmers, in the intersection operation of kmc_tools set counter calculation mode to values of A_i-kmers database. Let take result be A_i-kmers.intersect. This way A_i-kmers.intersect contains all k-mers of A_i.fastq but with counters taken from the whole database (A-kmers). Resulting database A_i-kmers.interect should be small enough to load into memory and use GetCountersForRead directly.

Of course, n should be adjusted reasonably, in general, the greater n the better, but too large n wold lead to databases that cannot fit into memory.

It is hard to predict which option would work best. None of the presented satisfies me :( but I am afraid that for now, I cannot propose anything else.

Out of curiosity what is your input that makes the database so huge?

tbenavi1 commented 3 years ago

Thank you for your detailed response! The last option you explain is a good course of action for our pipeline. I have one final question. You mentioned sorting the A-kmers to optimize further operations. Would sorting the A_i-kmers also optimize the intersect operation? Thanks for your help.

The data are high coverage from a tree genome with an estimated genome size of 14 Gbp. The data are also ONT data so the higher error rate would increase the number of kmers in the database.

marekkokot commented 3 years ago

It is hard to predict if it is better to sort each A_i-kmers before the intersection, but intuitively I think it will work better (faster) if not it will not be sorted. The rationale is that kmc_tools performs set operation on sorted k-mers. If any input database is not sorted it will be during set operation. As A-kmers is an input many times (n times) it would be sorted n times. This is why I think it is wort to sort it in advance. On the other hand, each A_i-kmers will be read only once, so I doubt there will be any gain if you sort it before the set operation. In fact, it may be slower due to increased IO.

Technical detail: k-mers in database produced by KMC are not sorted, but they are grouped into bins. In each bin k-mers are sorted. The number of bins is 512 (default), so the sorting may be performed more efficiently than in the case of a totally random order of k-mres. This is also why it does not require reading whole database into memory. To keep memory requirements low the internal buffers are relatively small (2 x 2MiB per bin). If you store your database on HDD drive maybe using larger buffers would be better, but unfortunately, the size of these buffers is not configurable via command line parameters. Also, it is hard to predict the impact of these buffers' size for performance, but if you notice that sorting A -kmers is very slow it may be the cause.

Other remark: You could also create each A_i-kmers KMC databases and after that perform an intersection, but probably it will be better if you use A_i-kmers as soon as it is created, because this way you may avoid actual reading A_i-kmers database form disk, because of operating system caching.

When you check this approach let me know if it works satisfactorily.