uni-halle / gerbil

A fast and memory-efficient k-mer counter with GPU-support
MIT License
34 stars 14 forks source link

Support of compressed files #25

Open Paul-bunel opened 12 months ago

Paul-bunel commented 12 months ago

Hello, I have the same problem described here: https://github.com/uni-halle/gerbil/issues/22

I tried to run Gerbil on different compressed fastq data and it systematically output the following error (sometimes with thread[1] and sometimes with thread[0]):

 〉gerbil -k 28 -l 2 -t 2 -i -e 4G fv.txt tmp_bins tmp_gres
______________________________________________
Gerbil version 1.12
================= PARAMETERS =================
size of k-mers          :    28
size of minimizers      :     7
threshold min           :     2
normalized kmers        :  true
number of temp-files    :   512
total number of threads :     4
number of splitters     :     2
number of hashers       :     2
input                   :       fv.txt
temp                    :       tmp_bins
output                  :       tmp_gres
size of memory          :  4096 MB
number of gpu's         :     0
---------------------------------------------
Thread[0]: read file 'SRR072013.fastq.gz' ( 587 MB)...
Thread[1]: read file 'SRR072029.fastq.gz' ( 553 MB)...
ERROR: unexpected end of stream (thread[1])! Archive corrupt?!

Any idea ?

Just in case, when I installed the tool, the cmake command output this (I have CMake 3.22 and I don't have CUDA):

CMake Deprecation Warning at CMakeLists.txt:1 (cmake_minimum_required):
  Compatibility with CMake < 2.8.12 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.

-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Boost: /usr/lib/x86_64-linux-gnu/cmake/Boost-1.74.0/BoostConfig.cmake (found version "1.74.0") found components: system thread filesystem regex 
-- Found BZip2: /usr/lib/x86_64-linux-gnu/libbz2.so (found version "1.0.8") 
-- Looking for BZ2_bzCompressInit
-- Looking for BZ2_bzCompressInit - found
-- Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found version "1.2.11") 
-- Found Threads: TRUE  
CUDA_TOOLKIT_ROOT_DIR not found or specified
-- Could NOT find CUDA (missing: CUDA_TOOLKIT_ROOT_DIR CUDA_NVCC_EXECUTABLE CUDA_INCLUDE_DIRS CUDA_CUDART_LIBRARY) 
-- Build type: Release
-- Configuring done
-- Generating done
-- Build files have been written to: /home/paul/Travail/gerbil/build

make output:

[  5%] Building CXX object CMakeFiles/gerbil.dir/src/gerbil/Application.cpp.o
[ 11%] Building CXX object CMakeFiles/gerbil.dir/src/gerbil/Bundle.cpp.o
[ 17%] Building CXX object CMakeFiles/gerbil.dir/src/gerbil/FastFile.cpp.o
[ 23%] Building CXX object CMakeFiles/gerbil.dir/src/gerbil/FastParser.cpp.o
[ 29%] Building CXX object CMakeFiles/gerbil.dir/src/gerbil/FastReader.cpp.o
[ 35%] Building CXX object CMakeFiles/gerbil.dir/src/gerbil/KmcWriter.cpp.o
[ 41%] Building CXX object CMakeFiles/gerbil.dir/src/gerbil/KmerDistributor.cpp.o
[ 47%] Building CXX object CMakeFiles/gerbil.dir/src/gerbil/SequenceSplitter.cpp.o
[ 52%] Building CXX object CMakeFiles/gerbil.dir/src/gerbil/SuperReader.cpp.o
[ 58%] Building CXX object CMakeFiles/gerbil.dir/src/gerbil/SuperWriter.cpp.o
[ 64%] Building CXX object CMakeFiles/gerbil.dir/src/gerbil/TempFile.cpp.o
[ 70%] Building CXX object CMakeFiles/gerbil.dir/src/gerbil/debug.cpp.o
[ 76%] Building CXX object CMakeFiles/gerbil.dir/src/gerbil/gerbil.cpp.o
[ 82%] Building CXX object CMakeFiles/gerbil.dir/src/gerbil/global.cpp.o
[ 88%] Linking CXX executable gerbil
[ 88%] Built target gerbil
[ 94%] Building CXX object CMakeFiles/toFasta.dir/src/gerbil/toFasta.cpp.o
[100%] Linking CXX executable toFasta
[100%] Built target toFasta
Paul-bunel commented 12 months ago

UPDATE:

It might have something to do with the gz file, since it worked with another fastq.gz file from another source. But what's strange is that the tool kmc works fine with the original gz file, while gerbil doesn't.

Paul-bunel commented 11 months ago

FINAL UPDATE:

It seems the probleme comes from the fact that I used the tool parallel-fastq-dump to retrieve the fastq.gz files. The files are good, since they produce the exact same uncompressed files that we get with others method to retrieve the fastq.gz (such as the tool fastq-dump or direct download on the website), and since kmc works well with them. However, gerbil seems to not be able to handle those files.

What I did is uncompress the "corrupted" fastq.gz files, and then re-compressed them with gzip, and now gerbil works.