splatlab / squeakr

Squeakr: An Exact and Approximate k -mer Counting System
BSD 3-Clause "New" or "Revised" License
85 stars 23 forks source link

Segmentation fault (core dumped) #24

Open nordhuang opened 6 years ago

nordhuang commented 6 years ago

./squeakr-count -f -k 28 -s 20 -t 1 -o ./ S008_20180206001-8_ffpedna_pan-cancer-v1_5717_S8_R2_001.fq Reading from the fastq file and inserting in the QF Segmentation fault (core dumped)

head -8 S008_20180206001-8_ffpedna_pan-cancer-v1_5717_S8_R1_001.fq @NB551106:74:HG7CWBGX5:2:11106:12634:1554 1:N:0:AGTTCC ACTCTGGCCTGGGTGACAGAGTGAGACTCGGGCTAAGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAATAA + AAAAAAEEE////A/EAE/E///E6EA/A///<<EE//EEEEEEEEEEEEEEEEE6EEEEEEEEE6EEEE//AE/E<///<<EAAEA/EAAEE6/EEEEEEEEAA<AEEE//E///<<E<<///</E/E///A///A<</////////// @NB551106:74:HG7CWBGX5:4:22601:22501:19465 1:N:0:AGTTCC ACTCTGGCCTGGGTGACAGAGTGAGACTCGGGCTAAGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA + AAAAA/AAE///AA/EAEAE///EAAAA<//<A/EE6EEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEE//<EAA//////<<AA///A//E//E/EEEEE/E<//E///A///////6////</A////////A///6///6///////

I provide fastq files()S008_20180206001-8_ffpedna_pan-cancer-v1_5717_S8_R1_001.fq. Why did it throw a segmentation fault?

compbio commented 6 years ago

I got same problem. Any solution?

test_1.txt test_2.txt

rtjohnso commented 6 years ago

Can you send me the fastq file that caused the segfault? You can either attach it to your issue report, or send it via email. If it's large, you might try deleting lines to cut it down to a "minimal working example", i.e. a shorter file that still causes squeakr to crash.

Best, Rob

compbio commented 6 years ago

I loaded first 12 lines of fastq files

chelseaju commented 6 years ago

I am experiencing the same issue. It gives Segmentation fault (core dumped) when processing a fastq file with around 10 million of reads. When I tried to run a smaller file (including the sample file provided - test.fastq), it says

Error opening file for serializing
: No such file or directory 

any idea?

prashantpandey commented 6 years ago

Hi @chelseaju, which command are you running to count k-mers from the sample fastq file (test.fastq)? Because I am not able to reproduce the issue. I am using this command ./squeakr-count -f -k 28 -s 20 -t 1 -o ./ test.fastq.

Thanks, Prashant

prashantpandey commented 6 years ago

Hi @nordhuang, I tried running squeakr-count using the fastq file you provided. But I am not getting any segmentation fault. I am using this command ./squeakr-count -f -k 28 -s 20 -t 1 -o ./ tmp.fastq.

Could you please confirm that you are using the same command?

Thanks, Prashant

chelseaju commented 6 years ago

Hi @prashantpandey , thanks for the quick response. I ran the command line you suggested and it resolved the issue of "No such file or directory". Apparently, the error message rises when the output directory does not exist. However, the same command line still produces "Segmentation Fault" when processing a large number of reads (in my case, more than 387412 lines).

prashantpandey commented 6 years ago

Hi @chelseaju , is there a way I can access your fastq file to reproduce the issue?

Thanks, Prashant

chelseaju commented 6 years ago

I am attaching the smaller fastq file (with 387416 lines). I also tried line-by-line debugging. It seemed to me that the error occurs at the qf_serialize() function in threadsafe-gqf/gqf.c (line 2139). Unfortunately, I don't really know how to fix this.

test.fq.gz

prashantpandey commented 6 years ago

Hi @chelseaju, just wanted to check you are seeing the seg fault issue with the file (smaller fastq) you uploaded? Also, could you specify the exact comment you are using?

Thanks, Prashant

chelseaju commented 6 years ago

@prashantpandey I used the command you suggested ./squeakr-count -f -k 28 -s 20 -t 1 -o . test.fq

prashantpandey commented 6 years ago

Hi @chelseaju,

I tried reproducing your bug but I am actually able to count k-mer in test.fq file that you provided without any bug.

./squeakr-count -f -k 28 -s 20 -t 1 -o . test.fq
Reading from the fastq file and inserting in the QF
Total Time Elapsed: 3.161622seconds
Calc freq distribution: 
Total Time Elapsed: 0.020426seconds
Maximum freq: 129
Num distinct elem: 732988
Total num elems: 4643908
accopeland commented 6 years ago

Hi, I'm also seeing segfaults. I'm attaching the smallest file that produces an error on my machine (4316 pairs). Machine is

Linux dint01 2.6.32-696.18.7.el6.nersc.x86_64 #1 SMP
 product: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
       vendor: Intel Corp.
       physical id: 1
       bus info: cpu@0
       size: 2601MHz
       capacity: 2601MHz
       width: 64 bits
       capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp x86-64 constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid cpufreq

Compiled with NH=1 (otherwise see illegal instruction error).

Commands producing segfault (any 13<=k<=29) with uncompressed or gzip fastq. Strangely, k=11 works as do k=29 and various values up to 61, but I did not test exhaustively. Thread count doesn't seem to matter.

squeakr-count -f -k 13 -s 20 -t 22 -o ./ x.fq
Reading from the fastq file and inserting in the QF
Segmentation fault

x.fq.gz

Christina-hshi commented 6 years ago

Found a bug that causes segmentation fault of the program. Hi all, When running the program with parameters given as the example in README file. ./squeakr-count -f -k 28 -s 20 -t 1 -o . I also got the segmentation fault. Then I used GDB to see where went wrong inside the code. I found that the codes near line 1360 in "gqf.c" are not safe. ` 1360 uint64_t empty_slot_index = find_first_empty_slot(qf, runend_index+1);

1362 shift_remainders(qf, insert_index, empty_slot_index); `
In line 1360, it tried to find out the first_empty_slot at or after slot (runend_index+1). However, there could be no empty slot after it, so the returned slot index is larger than the total number of slots, which is not a valid index. As the result, it leads to a segmentation fault when accessing memory not owned by it in line 1362. So here we need to check whether "empty_slot_index" is out of bound.

What's more important, I think the real reason which causes the problem is the unreasonable parameters. For example, if we set "-k 28 -s 20", then we know the maximum number of different K-mers is 2^(2*28)=2^56 and the number of slots in RSQF is ~2^20. If we assume all objects stored in RSQF has frequency at least 3 and the hash_bits is 20(should be >=20 based on parameter '-s'), then based on the encoding scheme of the RSQF, it uses at least 3 slots to store each object. So actually, the RSQF can store at most (2^20)/3 objects. It is possible that after hashing, we get 2^20 different objects because 2^56 is larger than 2^20. That means RSQF will run out of empty slots after inserting (2^20)/3 unique objects. As a result, the "segmentation fault" will happen. Actually in the program, the hash bits is set to be s+8, so there is even higher chance to get "segmentation fault"(run out of owned space), since we may need to store 2^28 unique objects in memory blocks that can only holds at most (2^20)/3 unique objects with frequency >=3.

How to find reasonable parameters? The key step to find reasonable parameters is to estimate the number of distinct objects that are going to be inserted into RSQF.
In theory, the maximum number of distinct objects is min(4^k, 2^hash_bits) if we don't consider the amount of data we use and its specific properties. If 2^hash_bits <= 4^k, then since the RSQF can only store at most (2^hash_bits)/3 objects under our assumption, so it is very likely to run out of space and lead to potential errors. Thus, the solution seems to set k such that 4^k <= (2^hash_bits)/3. For example, if hash_bits is 20, then the k should be <=4. If we want to use large k, then we need increase hash_bits accordingly. For example, if we want k=28, then the hash_bits should be >=58. If each slot uses only 1 byte, then we need 2^58 bytes, which is memory prohibitive. Fortunately, the number of distinct objects is much smaller than the maximum number in theory in many real cases. For example, we want to build the K-mer spectrum for human genome using RSQF. Since the human genome is around 3 billions bp long, so there will be at most ~2*3=6 billion(by considering also reverse complement strand) unique K-mers if we assume no or extremely low sequencing errors in our data. So no matter how big the K is, the number of unique K-mers will always be bounded by ~6 billions. Therefore, we can set hash_bits >= 35 such that we will have low chance to run out of slots.

prashantpandey commented 6 years ago

Hi @Christina-hsh Thanks for looking into the segfault. You are right that the segfault happens because the number of slots in the CQF (counting quotient filter) is not enough. The example command ./squeakr-count -f -k 28 -s 20 -t 1 -o . is to count k-mers in the test.fastq file which contains fewer than 2^20 28-mers.

However, for other fastq files, to decide the correct size of the CQF (or -s argument) we have a script. This script takes as input the fastq file(s) and estimates the log of the number of slots needed for the Squeakr. Please try this script. It is also mentioned in the README. lognumslots.sh script can be used to estimate the log of number of slots in the CQF argument. The script takes as input the path to the output file of 'ntCard' (https://github.com/bcgsc/ntCard). It then calculates log of the number of slots needed by Squeakr to count k-mers.

Thanks, Prashant

prashantpandey commented 5 years ago

Hi @accopeland and @chelseaju , Could you please try the latest release on master branch. We have made some changes to the API and added auto-resizing. Please read the new README for the new CLI.

Please let me know if you still see the bug.

Thanks, Prashant

Tgrandis commented 5 years ago

Hello: I also get similar "illegal instruction" problems:

squeakr count -k 33 -t 1 -o Xxyl04.squeakr Xxyl04_R1.fastq Xxyl04_R2.fastq [2019-03-22 14:27:51.191] [squeakr_console] [info] Reading from the fastq file and inserting in the CQF. Illegal instruction (core dumped)

Initially I tried with multiple threads, but then I need to specify the -s parameter. To do that, I installed the ntcard programme, got a histogram output from there and tried the lognumslots script..

./scripts/lognumslots.sh Xxyl04_ntcard_k33.hist ./scripts/lognumslots.sh: line 9: 1166120114 - - : syntax error: operand expected (error token is "- ") ./scripts/lognumslots.sh: line 10: + 2 + 3 : syntax error: operand expected (error token is "* ") (standard_in) 1: syntax error

So I cannot get that to work either. I use ubuntu machine (Ubuntu 18.04.2 LTS). I installed squeakr v 1.0 only recently. All I wanted to do is the find a way to quickly (but multiple times) check the frequency of particular k-mers and to separate my kmers and corresponding sequences into error/low-copy/repeat groups.

Simply setting a value for s did not work either ./squeakr count -e -k 25 -s 20 -t 6 -o Xxyl04.squeakr Xxyl04_seq1.fastq Xxyl04_seq2.fastq [2019-03-22 14:20:37.588] [squeakr_console] [info] Reading from the fastq file and inserting in the CQF. Illegal instruction (core dumped)

I tried with a very small dataset (about 1,000 paired reads as fastq) a pair of 3 Gb gzipped datasets and a pair of 8 Gb gzipped datasets (the last one is the one that really needs to be analysed)

chelseaju commented 5 years ago

Hi @prashantpandey, I am still struggling with the number of slots in the CQF argument, which generates segfault in one of my files. I also tried to run lognumslots.sh, and got the same error as @Tgrandis observed.

I first ran ntCard, which generated an output of three columns "k", "f", and "n". k f n 15 1 97756207 15 2 46525201 15 3 22294693 15 4 10887250

It also outputs this information to screen: k=15 F1 5917728218 k=15 F0 213786893

Looking at the script lognumslots.sh, I could not find any line starting with "F0", "f1", nor "f2", and thus failed to run line 6-8. Given the information from ntCard, what is the formula we can use to estimate the number of slots?

Thanks, Chelsea

prashantpandey commented 5 years ago

Hi @chelseaju

The output format has changed in the new version of ntCard. I will update the script according to the new format ASAP.

Thanks, Prashant

prashantpandey commented 5 years ago

Hi @chelseaju ,

We are working to update our lognumslots.sh to work with the new ntCard format. But in the meantime, you can use the last ntCard release v1.0.1 commit fb05b32.

https://github.com/bcgsc/ntCard/releases/tag/1.0.1

This would get you unstuck.

Thanks, Prashant

prashantpandey commented 5 years ago

Hi @chelseaju ,

I have pushed a fix. But please make sure to not specify the output file option -o in the ntCard command. The script expects the F0 value to be in the output file. For example, ./ntcard -k <kmer-length> -p <prefix> <file>

Thanks, Prashant

chelseaju commented 5 years ago

Hi @prashantpandey, thanks for fixing this issue. lognumslots.sh works well with the output from ntcard. However, even with the suggested slots for the input parameters, I am still getting the segmentation fault error. The dataset I am testing contain around 87 million reads, with a read length of 180bp. When counting 15mers, it generates the segmentation fault message. However, it seems fine when counting 16mers and 17mers. Any idea about this?

Thanks, Chelsea

prashantpandey commented 5 years ago

Hi @chelseaju,

Yeah, I guess I understand what's going on. How many times does it resize before crashing? Also, how many 16-mers/17-mers are there?

Here's what might be going on: With 15-mers we get 30-bit hashes of k-mers in Squeakr-exact. To insert the hashes in the quotient filter we split the hashes into quotient and remainder bits. By default, the remainder is 8 bits which makes quotient 22 bits. We create the quotient filter with 2^22 slots where each slot is 8-bits. Every time you resize it borrows a bit from the remainder and increases the quotient in-order to increase the number of slots in the structure.

With a small k-mer size (and smaller hash value) it can't resize enough times in-order to insert all the k-mers.

Thanks, Prashant

chelseaju commented 5 years ago

Hi @prashantpandey,

Thanks for the quick response. For 15mers, if I set the slot = 29 (as recommended), it crashes relatively soon (before the first resizing). If I set the slot = 28, it crashes after the first resizing.

For 16-mers/17-mers, I set the slot to 29 as well, and it resizes once.

In a case like this, do you recommend running the approximate count instead of Squeakr-exact?

Thanks, Chelsea

prashantpandey commented 5 years ago

Hi @chelseaju ,

For 15-mers, if the slot=29 and it resizes it means that the quotient filter has no space to keep counts of k-mers (because there are no bits left for the remainder). It also means there are enough k-mers that it would be better to use a counting table instead of the quotient-filter-based hash table which is used in Squeakr-exact.

We are working on adding a workaround in Squeakr to handle this case when k is small and dataset contains almost all 4^K k-mers. However, it might take a few days to get this out.

In the meantime, you can use Squeakr-approximate and try with slot=24 (because in the approx. mode it uses 8-bit remainders, therefore, the total hash size would be 24+8 = 32 bits) and see if it's able to resize and complete.

However, since the total number of hash values (2^32) is much more than the total number of k-mers there would be very few collisions and you would get (almost) exact counts.

Thanks, Prashant

kamimrcht commented 5 years ago

Hi @prashantpandey I noticed a segfault using squeakr on large read files with multiple threads. It stops very early:

[2019-07-11 09:28:25.135] [squeakr_console] [info] Reading from the fastq file and inserting in the CQF.
Segmentation fault (core dumped)

I tried the latest version and the commit of november 18. I compiled with NH = 1. I tried with k=21 and 31, s=29 and 33. Anytime, with more than 10 million reads I end up with the same error. When I use a sample of less than 10 million reads, squeakr works fine. This is my command line: ./squeakr/squeakr count -e -k 31 -t 20 -s 29 -o results/tmp.squeakr ERR164480_1.fastq Then I tried to use a single thread and I could run squeakr on the whole file.

The datasets can be found here: https://www.ncbi.nlm.nih.gov/sra/ERX140357 (I also tried both fastq)

Thank you!

prashantpandey commented 5 years ago

Hi @kamimrcht , I will try and reproduce this locally and will get back to you.

Thanks, Prashant