splatlab / squeakr

Squeakr: An Exact and Approximate k -mer Counting System
BSD 3-Clause "New" or "Revised" License
85 stars 23 forks source link

Segmentation fault + question #17

Closed GuillaumeHolley closed 6 years ago

GuillaumeHolley commented 6 years ago

Hi there :)

So, I just downloaded the code of Squeakr, compiled it and run squeakr-count on the test file: everything went fine.

Then, I tried squeakr-count on a different file: https://www.ncbi.nlm.nih.gov/sra/?term=ERR430991. The Squeakr paper says "Squeakr takes the approximation of number of distinct k-mers as the number of slots to create the CQF". The number of distinct 31-mers in my file in roughly 30 million. Hence, I configured Squeakr with a CQF size parameter of 25 (roughly the log2 of 30 million).

So, I used the following command:

./squeakr-count 0 25 1 ERR430991.fastq

And the result is:

Reading from the fastq file and inserting in the QF Segmentation fault (core dumped)

I tried again with 30 instead of 25 as CQF size parameter and this time, I obtained:

Reading from the fastq file and inserting in the QF Total Time Elapsed: 51.129627seconds Calc freq distribution: Total Time Elapsed: 0.751258seconds Maximum freq: 897 Num distinct elem: 28471250 Total num elems: 343450408

So, my first question is: am I doing something wrong when I pick the CQF size parameter? Should I overestimate the approximation of the number of distinct 31-mers in my file?

Also, I used KMC3 to compute k-mer counts from the same file and KMC3 give me in output 30311678 distinct 31-mers while Squeakr says there are 28471250 distinct 31-mers. Any ideas on the difference of about 1.5 million distinct 31-mers?

Thank you for your time and help!

Guillaume

prashantpandey commented 6 years ago

Hi Guillaume,

There are two separate questions. Let me answer then one by one.

  1. The log2 of the number of slots passed as a parameter to squeakr-count is an estimation of the total number of slots needed to store all k-mers and their counts. So, given that the fastq file has 30M distinct k-mers, creating a CQF of 32M slots would be tight because counts of the k-mers would also occupy some slots. We usually first use ntCard to estimate the cardinality of the fastq file(s) for a given k-mer size and then use the output to approximate the size of the CQF. The way we approximate is this:

We use ntCard to estimate the number of distinct k-mers F0 and the number of k-mers of count 1 and 2 (f1 and f2, respectively) in each experiment. We then estimated the number of slots needed in the counting quotient filter as s = f1 + 2f2 + 3(F0 − f1 − f2). The number of slots in the counting quotient filter must be a power of 2, so let s' be the smallest power of 2 larger than or equal to s. In order to be robust to errors in our estimate, if s was more than 0.8s', we construct the counting quotient filter with 2s' slots. Otherwise, we construct the counting quotient filter with s' slots.


  1. The number of distinct k-mers reported by Squeakr is 28M and reported by KMC3 is 30M. There are two reasons for this. First, Squeakr is an approximate k-mer counter. So, because of collisions in k-mer hashes, the number of distinct k-mers reported by Squeakr would be less compared to an exact k-mer counter. However, the difference is always very small and is bounded by the error in Squeakr which is also very very small (always smaller than 0.003). Second, here you are seeing such a big difference in the number of distinct k-mer because, in Squeakr, the default k-mer size is 28. So, the numbers reported by Squeakr are for 28-mers. I am writing a patch for users to specify the k-mer size as an input parameter.

For the sanity check, I changed the code to count 31-mers and here is the output from Squeakr:

./squeakr-count 1 30 1 ../kmercounting/data/ERR430991_1.fastq.gz ../kmercounting/data/ERR430991_2.fastq.gz 
Reading from the fastq file and inserting in the QF
Total Time Elapsed: 62.578545seconds
Calc freq distribution: 
Total Time Elapsed: 0.873581seconds
Maximum freq: 866
Num distinct elem: 30309948
Total num elems: 329329415

Thanks, Prashant

prashantpandey commented 6 years ago

I just pushed a patch and updated the README to specify the k-mer size as an input parameter.

GuillaumeHolley commented 6 years ago

Hi Prashant :)

Thank you for your detailed answer!

Regarding my first question, thanks for the clarification. It might be worth updating your README file with this description on how to configure the CQF size parameter of Squeaker, such that other users can estimate this parameter as well ;)

Regarding my second question, I don't know why I thought the default k was 31 while it is 28. Thanks for adding k as a parameter of your software.

Is there anyway to dump the k-mers and their approximate counts on disk from a Squeakr representation? Also, do you plan to release the exact counting version of squeakr-count?

Guillaume

prashantpandey commented 6 years ago

Hi Guillaume,

Thanks for the suggestion. I will add a script that uses the output from ntCard and returns the size of the CQF needed by Squeakr.

In the approximate version of Squeakr, we can't recreate k-mers from hashes. (We use a one-way Murmur Hash.) Therefore, we can't list out k-mers and their approximate counts. However, you can build a frequency distribution histogram by listing out all the counts.

And yes, I just released Squeakr exact. However, Squeakr-exact currently doesn't support k-mers of size greater than 32.

Thanks, Prashant