uni-halle / gerbil

A fast and memory-efficient k-mer counter with GPU-support
MIT License
34 stars 14 forks source link

gerbil "make" is not working #5

Open saranpons3 opened 7 years ago

saranpons3 commented 7 years ago

Hello Developers of Gerbil, I'm trying to install Gerbil in my desktop computer which has got Ubuntu 16.04, i7 processor and with GeForce GTX TITAN Black/PCIe/SSE2 GPU. But "make" command hangs forever with the following console messages

/home/saravanan/gerbil/src/gerbil/../../include/gerbil/KMer.h: In instantiation of ‘void gerbil::KMer<K, B, 8u, C>::setInv(const byte const&) [with unsigned int K = 480u; unsigned int B = 121u; unsigned int C = 16u; gerbil::byte = unsigned char]’: /home/saravanan/gerbil/src/gerbil/../../include/gerbil/KMer.h:167:4: required from ‘static void gerbil::KMer<K, B, 8u, C>::set(const byte const&, gerbil::KMer&, gerbil::KMer&) [with unsigned int K = 480u; unsigned int B = 121u; unsigned int C = 16u; gerbil::byte = unsigned char]’ /home/saravanan/gerbil/src/gerbil/../../include/gerbil/KmerHasher.h:355:21: required from ‘void gerbil::KmerHasher::processThreadSplit(const uint8_t&, gerbil::SyncSwapQueueMPSC<gerbil::cpu::KMerBundle >, gerbil::SyncSwapQueueMPSC<gerbil::gpu::KMerBundle >) [with unsigned int K = 480u; bool NORM = true; uint8_t = unsigned char]’ /home/saravanan/gerbil/src/gerbil/../../include/gerbil/KmerHasher.h:107:38: required from ‘gerbil::KmerHasher::processSplit(gerbil::SyncSwapQueueMPSC<gerbil::cpu::KMerBundle >, gerbil::SyncSwapQueueMPSC<gerbil::gpu::KMerBundle >)::<lambda(const uint8_t&, gerbil::SyncSwapQueueMPSC<gerbil::cpu::KMerBundle >, gerbil::SyncSwapQueueMPSC<gerbil::gpu::KMerBundle >)> [with unsigned int K = 480u; uint8_t = unsigned char]’ /home/saravanan/gerbil/src/gerbil/../../include/gerbil/KmerHasher.h:102:10: required from ‘struct gerbil::KmerHasher::processSplit(gerbil::SyncSwapQueueMPSC<gerbil::cpu::KMerBundle >, gerbil::SyncSwapQueueMPSC<gerbil::gpu::KMerBundle >) [with unsigned int K = 480u]::<lambda(const uint8_t&, class gerbil::SyncSwapQueueMPSC<gerbil::cpu::KMerBundle<480u> >, class gerbil::SyncSwapQueueMPSC<gerbil::gpu::KMerBundle<480u> >)>’ /home/saravanan/gerbil/src/gerbil/../../include/gerbil/KmerHasher.h:99:34: required from ‘void gerbil::KmerHasher::processSplit(gerbil::SyncSwapQueueMPSC<gerbil::cpu::KMerBundle >, gerbil::SyncSwapQueueMPSC<gerbil::gpu::KMerBundle >) [with unsigned int K = 480u]’ /home/saravanan/gerbil/src/gerbil/../../include/gerbil/KmerHasher.h:162:22: required from ‘gerbil::KmerHasher::process_template()::<lambda()> [with unsigned int K = 480u]’ /home/saravanan/gerbil/src/gerbil/../../include/gerbil/KmerHasher.h:120:23: required from ‘struct gerbil::KmerHasher::process_template() [with unsigned int K = 480u]::<lambda()>’ /home/saravanan/gerbil/src/gerbil/../../include/gerbil/KmerHasher.h:119:19: required from ‘void gerbil::KmerHasher::process_template() [with unsigned int K = 480u]’ /home/saravanan/gerbil/src/gerbil/../../include/gerbil/KmerHasher.h:535:6: required from here /home/saravanan/gerbil/src/gerbil/../../include/gerbil/KMer.h:149:23: warning: left shift count >= width of type [-Wshift-count-overflow] data[c]<<=_c_offset; ^ /home/saravanan/gerbil/src/gerbil/../../include/gerbil/KMer.h:152:13: warning: left shift count >= width of type [-Wshift-count-overflow] data[0] <<= _c_offset; ^ /home/saravanan/gerbil/src/gerbil/../../include/gerbil/KMer.h: In instantiation of ‘void gerbil::KMer<K, B, 8u, C>::setInv(const byte const&) [with unsigned int K = 512u; unsigned int B = 129u; unsigned int C = 17u; gerbil::byte = unsigned char]’: /home/saravanan/gerbil/src/gerbil/../../include/gerbil/KMer.h:167:4: required from ‘static void gerbil::KMer<K, B, 8u, C>::set(const byte const&, gerbil::KMer&, gerbil::KMer&) [with unsigned int K = 512u; unsigned int B = 129u; unsigned int C = 17u; gerbil::byte = unsigned char]’ /home/saravanan/gerbil/src/gerbil/../../include/gerbil/KmerHasher.h:355:21: required from ‘void gerbil::KmerHasher::processThreadSplit(const uint8_t&, gerbil::SyncSwapQueueMPSC<gerbil::cpu::KMerBundle >, gerbil::SyncSwapQueueMPSC<gerbil::gpu::KMerBundle >) [with unsigned int K = 512u; bool NORM = true; uint8_t = unsigned char]’ /home/saravanan/gerbil/src/gerbil/../../include/gerbil/KmerHasher.h:107:38: required from ‘gerbil::KmerHasher::processSplit(gerbil::SyncSwapQueueMPSC<gerbil::cpu::KMerBundle >, gerbil::SyncSwapQueueMPSC<gerbil::gpu::KMerBundle >)::<lambda(const uint8_t&, gerbil::SyncSwapQueueMPSC<gerbil::cpu::KMerBundle >, gerbil::SyncSwapQueueMPSC<gerbil::gpu::KMerBundle >)> [with unsigned int K = 512u; uint8_t = unsigned char]’ /home/saravanan/gerbil/src/gerbil/../../include/gerbil/KmerHasher.h:102:10: required from ‘struct gerbil::KmerHasher::processSplit(gerbil::SyncSwapQueueMPSC<gerbil::cpu::KMerBundle >, gerbil::SyncSwapQueueMPSC<gerbil::gpu::KMerBundle >) [with unsigned int K = 512u]::<lambda(const uint8_t&, class gerbil::SyncSwapQueueMPSC<gerbil::cpu::KMerBundle<512u> >, class gerbil::SyncSwapQueueMPSC<gerbil::gpu::KMerBundle<512u> >)>’ /home/saravanan/gerbil/src/gerbil/../../include/gerbil/KmerHasher.h:99:34: required from ‘void gerbil::KmerHasher::processSplit(gerbil::SyncSwapQueueMPSC<gerbil::cpu::KMerBundle >, gerbil::SyncSwapQueueMPSC<gerbil::gpu::KMerBundle >) [with unsigned int K = 512u]’ /home/saravanan/gerbil/src/gerbil/../../include/gerbil/KmerHasher.h:162:22: required from ‘gerbil::KmerHasher::process_template()::<lambda()> [with unsigned int K = 512u]’ /home/saravanan/gerbil/src/gerbil/../../include/gerbil/KmerHasher.h:120:23: required from ‘struct gerbil::KmerHasher::process_template() [with unsigned int K = 512u]::<lambda()>’ /home/saravanan/gerbil/src/gerbil/../../include/gerbil/KmerHasher.h:119:19: required from ‘void gerbil::KmerHasher::process_template() [with unsigned int K = 512u]’ /home/saravanan/gerbil/src/gerbil/../../include/gerbil/KmerHasher.h:535:6: required from here /home/saravanan/gerbil/src/gerbil/../../include/gerbil/KMer.h:149:23: warning: left shift count >= width of type [-Wshift-count-overflow] data[c]<<=_c_offset; ^ /home/saravanan/gerbil/src/gerbil/../../include/gerbil/KMer.h:152:13: warning: left shift count >= width of type [-Wshift-count-overflow] data[0] <<= _c_offset;

srechner commented 7 years ago

Compiling Gerbil from scratch may take a long time (up to half an hour). Please be patient. ;-)

saranpons3 commented 7 years ago

Sir, Thanks for your reply. But it takes more than half an hour. I followed all the steps here https://github.com/uni-halle/gerbil to install gerbil. But it is taking more than half an hour. It is hanging at the following line for long time /home/saravanan/gerbil/src/gerbil/../../include/gerbil/KMer.h:152:13: warning: left shift count >= width of type [-Wshift-count-overflow] data[0] <<= _c_offset;.

srechner commented 7 years ago

That's strange. Could you please create a fresh build2 directory, run the commands

    cd build2
    cmake -DCMAKE_BUILD_TYPE=Release ..

and paste the output? Please also paste the output of the command free.

saranpons3 commented 7 years ago

Hello Sir, "make" is done successfully and gerbil executable is successfully created inside build folder. But I don't know how much time "make" took. Because I started it on Saturday evening and left work place. Then came and checked on Monday morning. So, don't know when "make" got over.

             Now, I'm trying to execute gerbil on my data set which is of around 10.2 GB. This is a different data set  than the one given in your published paper. But when i try to execute gerbil, i get the following message
                     **"size e of memory (0)  is too small (should be: e >= 512)"**

              My execution command is as follows
                             ./gerbil -k 28  /SRR_10_2_GB/  temp  output

              My hardware configuration is as follows
                              cpu: Intel® Core™ i7-4770 CPU @ 3.40GHz × 8 
                              gpu: GeForce GTX TITAN Black/PCIe/SSE2 GPU
                              RAM: 8 GB
                              Harddisk: 1 TB
           My guess for this error message is that the lack of RAM capacity. What made me to think this way is that in your published paper, both the RAM you used is 16 GB and 32 GB. 

           Kindly let me know that why this error message i'm getting in my system and let me know my guess is correct or not?

         Thanks in advance.
merbert commented 7 years ago

Hello Saranpons3, It looks as the size of your memory was not recognized correctly. You can also set the memory manually. -e 6GB The minimum memory requirement is 512 MB.

saranpons3 commented 7 years ago

Hello merbert, Thanks for your reply. Now it is working fine. But I'm finding difficult to interpret the output. I have read the example given in the "readme.md" file. But still not able to understand the output. I have pasted two lines from my output file. Could you help me in understanding these two lines? I have taken k=28 000d511a: 10011111 11000110 10110111 11110011 10000010 01100100 .....d 000d5120: 00001101 00001101 00101000 00100001 01100101 01011010 ..(!eZ

merbert commented 7 years ago

first byte: 10011111 --> 128+16+8+4+2+1=159 159 != 255 --> 159 is the counter

k / 4 = 28 / 4 = 7 --> next 7 bytes represent the k-mer next byte 11 00 01 10 --> T A C G and so on. After all 7 bytes have been read, start again with the counter byte. Be careful, the counter can also be 5 bytes long.

merbert commented 7 years ago

I actually thought that we have also provided a small c++ example for conversion to the FASTA format. We will check this again and upload if necessary.

srechner commented 7 years ago

Okay, we added a tool that converts Gerbil's output to FASTA format. You can fetch the update with git pull. The tool is compiled automatically when re-run cmake and make. However, it may be faster to just compile it manually:

    cd gerbil
    git pull
    cd build/
    g++ -O3 ../src/gerbil/toFasta.cpp -o toFasta

Afterwards, you can convert the Gerbil output to human-readable FASTA format by running

    toFasta <gerbil-output> <k> [<fasta-output>]
saranpons3 commented 7 years ago

Thanks developers for your answers. I would like to know that how much time it takes to convert gerbil output to fasta format?

merbert commented 7 years ago

Well, this depends mainly on the number of k-mer in the output file, but it should be relatively fast. However, there is now a new option "-o fasta" for direct fasta output.

saranpons3 commented 7 years ago

Hello Gerbil Developers, I used -o option to get the direct fasta output and as you said it is relatively fast. When i ran Gerbil on F Vesca data set, I had the following doubts. Could you clarify me? 1) In your paper, You have mentioned that the size of F Vesca data set is 10.2 GB. But When i downloaded i got total size as 9.5 GB. I downloaded all 11 files(SRR005.fastq, SRR072006.fastq, SRR072007.fastq, SRR072008.fastq, SRR072009.fastq, SRR072010.fastq, SRR072011.fastq, SRR072012.fastq, SRR072013.fastq, SRR072014.fastq, SRR072029.fastq). I'm not understanding why this difference is? 2) When I run Gerbil on F Vesca data set with Kmer size as 28, the total number of distinct kmers generated is 342160198 when normalization is disabled. When normalization is enabled the number of distinct kmers is 219934955. But in your paper in the table you have mentioned that number of distinct kmers is 632436468. I'm not understanding why this difference is? 3) When I use k mer size above 136, Gerbil is not supporting. I would like to know that does Gerbil not support kmer size above 136 only for this data set or for all the data sets it will not support any value above 136? If so, Why this limitation?

                   My system configuration is as follows 
                           Hard disk : 1 TB
                            RAM: 32 GB 
                            Processor: Intel® Core™ i7-4770 CPU @ 3.40GHz × 8 
                            GPU: GeForce GTX TITAN Black/PCIe/SSE2

                      Please clarify me. Thanks in advance.
merbert commented 7 years ago

1. The file size in our paper is given in GB (1 GB = 1000^3 B). In your file system, the size appears to be in GiB (1 GiB = 1024^3 B). Unfortunately, it is rarely specified correctly, even in file systems. 10.2 GB * (1000/1024)^3 --> 9.5 GiB

2. Could you please give us the output of the console? Did you use the GPU? Then you have to add the values of CPU and GPU. Exp. output with gpu: ukmers (CPU) : 198266373 ukmers (GPU) : 434170095

==>total: 632436468

3. Unfortunately, the program requires a lot of RAM and compile time to support many k's.Therefore we have temporary limited the k per default to 128+8 .(Computers with 8GB or less were in trouble with k in {8,...,520}). There is no make-option yet, but you can manually set it back to 512. in File: include/gerbil/KmerHasher.h Line 536/537: LOOP512(MAX_KMER_SIZE, C_PROC); // remove the "//" //LOOP128(MAX_KMER_SIZE, C_PROC); // add "//" in File: include/gerbil/config.h Line 95

define DEF_KMER_RANGE 512 // 128 --> 512

recompile

saranpons3 commented 7 years ago

Hello Merbert, 1) Thanks for clarifying my doubt on file size difference. 2) I have run Gerbil only in CPU mode. The following is the command I have used to run Gerbil ./gerbil -k 28 -e 10GB -d srr/ temp output1 Console message Thread[0]: read file 'SRR072013.fastq' ( 1 GB)... Thread[0]: read file 'SRR072029.fastq' ( 1 GB)... Thread[0]: read file 'SRR005.fastq' ( 882 MB)... Thread[0]: read file 'SRR072009.fastq' ( 872 MB)... Thread[0]: read file 'SRR072011.fastq' ( 836 MB)... Thread[0]: read file 'SRR072012.fastq' ( 832 MB)... Thread[0]: read file 'SRR072014.fastq' ( 800 MB)... Thread[0]: read file 'SRR072010.fastq' ( 799 MB)... Thread[0]: read file 'SRR072006.fastq' ( 729 MB)... Thread[0]: read file 'SRR072007.fastq' ( 680 MB)... Thread[0]: read file 'SRR072008.fastq' ( 425 MB)...


The following is the command I used for converting output to fasta format ./toFasta output1 28 kmers_out Console message input file : 'output1' (2739490864 B) output file : 'kmers_out' k : 28 start converting to FASTA... bytes read : 2739490864 B
bytes written : 11037487057 B Could you tell me why the difference in number of distinct kmers? 3) After incorporating the changes you have mentioned, now it is working for any kmer size above 136.

Please clarify me why there is a difference in number of distinct kmers. Thanks.

saranpons3 commented 7 years ago

Hello Gerbil developers, As you have tabulated in your paper, the largest data set size you have tried with Gerbil is H Sapiens 2 (339.5 GB). Have you tried any other data set larger than H Sapiens 2? I would like to know that how Gerbil would perform when the data set size is above 1 TB. Could you reply? Thanks in advance.

merbert commented 7 years ago

Can you please run gerbil with -i option for debug output? I will take care of it as soon as possible.

In general, there should be no problems with input data of more than 1 TB, but we have not yet extensively studied such data (If I remember correctly, up to about 800 GB). However, the performance depends mainly on the following factors: -k -number of distinct k-mers (Depends mainly on the genome size and the error rate) -number of k-mers (scales with input size, but not so important in gerbil with respect to the required main memory) For large data sets, in general the memory limit is the deciding factor. Gerbil, however, should be able to deal with this problem quite well. Furthermore, the last update should have extended the maximum input data for a high-performance run.

saranpons3 commented 7 years ago

Dear Merbert, Are k-mers whose occurrence is 1(one) have any purpose? A K-mer counter tool should write k-mers occurring only once to the output file?

saranpons3 commented 7 years ago

Dear Merbert, I got the following error message when I fixed Kmer value=1000 for the data set rel3-nanopore-wgs-288418386-FAB39088.fastq.gz from https://github.com/nanopore-wgs-consortium/NA12878 size of K of kmers (1000) is too large.should be K<=520) I have tried with this big value(1000) for K-mer because the length of sequencing reads of this data set is above 2000bp as these reads are generated by 3rd generation sequencing machine(nanopore sequencing). But Gerbil showed the above error message when k=1000. Does Gerbil currently not support K value above 512? For 3rd generation sequencing reads(lengthier), does K-mer value need not to be bigger that is above 512 and so on? Please give me your inputs.

merbert commented 6 years ago

Hello saranpons3,

Are k-mers whose occurrence is 1(one) have any purpose?

Well, if you get your input data from a sequencer machine, you have also read erroneous bases. Every faulty base generates k faulty k-mers. A k-mer produced in this way is rarely present more than once or twice. Usually such k-mers are ignored.

A K-mer counter tool should write k-mers occurring only once to the output file?

Set option -l 1 for all k-mers (The number describes the minimal occurrence of a k-mer in the output)

Does Gerbil currently not support k value above 512?

Not native, but with several lines of code it would be. Is there a meaningful application for this? Large k leads to an incredible high amount of faulty k-mers and it would be pretty hard to distinguish between the faulty and the true k-mers, even with high quality 3rd generation sequencing reads.