smithlabcode / falco

A C++ drop-in replacement of FastQC to assess the quality of sequence read data
https://falco.readthedocs.io
GNU General Public License v3.0
98 stars 11 forks source link

Segmentation faults apon writing output #45

Open kmshort opened 1 year ago

kmshort commented 1 year ago

Hi, I've compiled Falco: configure:

 ./configure CXXFLAGS="-O3 -Wall"
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a race-free mkdir -p... /usr/bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking whether make supports nested variables... (cached) yes
checking for g++... g++
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether the compiler supports GNU C++... yes
checking whether g++ accepts -g... yes
checking for g++ option to enable C++11 features... none needed
checking whether make supports the include directive... yes (GNU style)
checking dependency style of g++... gcc3
checking whether g++ supports C++11 features with -std=c++11... yes
checking for g++ -std=c++11 option to support OpenMP... -fopenmp
checking for zlibVersion in -lz... yes
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating Makefile
config.status: creating config.h
config.status: executing depfiles commands

make:

make all
make  all-am
make[1]: Entering directory '/home/blahuser/progs/falco-1.2.1'
  CXX      src/falco-falco.o
  CXX      src/falco-FastqStats.o
  CXX      src/falco-HtmlMaker.o
  CXX      src/falco-Module.o
  CXX      src/falco-StreamReader.o
  CXX      src/falco-FalcoConfig.o
  CXX      src/falco-OptionParser.o
  CXX      src/falco-smithlab_utils.o
  CXXLD    falco
make[1]: Leaving directory '/home/blahuser/progs/falco-1.2.1'

install:

sudo make install
make[1]: Entering directory '/home/blahuser/progs/falco-1.2.1'
 /usr/bin/mkdir -p '/usr/local/bin'
  /usr/bin/install -c falco '/usr/local/bin'
make[1]: Nothing to be done for 'install-data-am'.
make[1]: Leaving directory '/home/blahuser/progs/falco-1.2.1'

and run falco: falco sequencing.fq.gz

and get output:

[limits]        using file /home/blahuser/progs/falco-1.2.1/Configuration/limits.txt
[adapters]      using file /home/blahuser/progs/falco-1.2.1/Configuration/adapter_list.txt
[contaminants]  using file /home/blahuser/progs/falco-1.2.1/Configuration/contaminant_list.txt
[Mon May  8 14:33:37 2023] Started reading file sequencing.fq.gz
[Mon May  8 14:33:37 2023] reading file as gzipped FASTQ format
[running falco|===================================================|100%]
[Mon May  8 14:42:22 2023] Finished reading file
[Mon May  8 14:42:22 2023] Writing summary to ./summary.txt
[Mon May  8 14:42:22 2023] Writing text report to ./fastqc_data.txt
[Mon May  8 14:42:22 2023] Writing HTML report to ./fastqc_report.html
Segmentation fault

I have paired end sequences, that have gone through trim galore!

I've tested on the R1 - and falco runs fine (it's sooooo much faster than fastQC, it's amazing).

But falco crashes with a segfault on the R2 sequence. The file is a 15302780411 byte (~15.3 gig) gzipped fastq file.

The head of the original file started something like this (I passed a modified version of this which had gone through trimgalore).

@V350096722L1C001R00100001050
GTTCGAACTAATTTCCAAAACGAATATACAAACTTACAATCGCACCAACAATAAAAAAAAATTCCTCTTTCTCCACATCCACACCAACATCTACTATCAC
+
HA=HH;C?BED@;BF9EFFCBGE8AECEEEED/</FGEDBEH7E7BFCEFC7DFEECEC'E.<8D:C=3=@3F1EAD0FD/GDFDFE4E,BCFFD@CGFF
@V350096722L1C001R00100001075
GCGACACTATCAAAACACTACACCCACCTCAATTTACCCAAACTCTACCACCCTTTTTAAAAAAAAAAAAAAACCCCTCTTATCCTAAACTATCTCTCAA
+
G?FBGDDCCBBEEBEADBCCFE792CD<DCCEC;BE:B>EBEBA<:CBDBD9BB@?B@CEEEEEECBCCECBE:+=C61C=EB=AAC@B98E,A:(C5>#
@V350096722L1C001R00100001079
TCGACTACTACAAACCTATCTCCCAACTCCACACTACCTACCTCTACTACACAAAACCCACAAATCAAAAAAACACACAACTAAACACCAAACACGTGTA
+
@ECC5;EDBE=EDBDCCE?DDDCD6FE:8@E*C@'EBD9E7=A7BCADE6F:AC9D8:CDDEDBB=EEEEDDE<C7D>C(1+C?C+/EDCE7*E,2CB:E
@V350096722L1C001R00100001117
CGAATACTTCACTAACTCCAAACAACTCGAAACCAACCTTACCAAACTTACTAAAACGAAATAACGTATTACCCTCTCTAATATTCACTTTCCGAAATCA
+
FIFDDDCCCFDFDDFDCDHDFDHDFDDHIFFEGDDFHGDCDGIFEDEDDFECFFFFGIEFFCDFGHCFCDDGGGCDDHDFEBFCDHDH;DAHHHAEFDGE
@V350096722L1C001R00100001129
GCGAAAAAAAATAAAACCAATCTCATTAATCATTATCATAACTATAAAACAACAAAAAACGAAAATAAAAAAAACACACAACAAAACTCCAATCACGTGT
+
CD:=CCEDE8F-?FCD($ED3EB;E7BE=4AD,<F8ED3B@C@C4FECEFDFAFEAA>60;FC?D$7DECEDDD>B=D9E:<$EDEB3G2B?D&E(;DG%
@V350096722L1C001R00100001130
CGAACACAACCAACCATCTTCAAAAAATCACCACCCTTCACACACACAAACATCAATACACAACAACTCACCACACCTCACAATCCACACACCCCAAACA
+
EFBECCBCDCBEBDACCADBBCBDBCD8BE>:BA.A?B>E@CBEBEEEAEDABBE?BCDE=DCBEEB@=E@3B<E.%?DDFB&??AA5EEBD9@ABEC@E
@V350096722L1C001R00100001146
CGAAACCCGAACCCCCACGAACCGACGACTCTTACCGCCTAATCACCCACCAACAACCAACGATCAACAACAAACGACAAACAACAAACACCACTAAATC
+
=@<EE>B8H2CC?.59CBG$=?8??>GDA@?BAAC>H(2.BA?8A;45E5DEA<?@='CEA7<*)CE&@E8=C?&;C;DCE<6C9CCC.E4;D23ECB1@
@V350096722L1C001R00100001155
CGACCCTACATAATAATTTTAATAATTTAAAAAACGAAACAATTCCGCGATATAAAATTTTCTACTCTAAAACGACATCGAAATTTACAACCGAAAAATC
+
FDDFIFCDFFCDECFDCCCDDFCDECDBDFFFCDDHEEECFEDDG@HEIFDFDDEBFBBCDECDGCGDEEEFFHEED3=HC#D@DCDGDEDBHDEEECCG
@V350096722L1C001R00100001162
CGACCAACAAACAACACACACACCCACACAACTCTAAACACCCCAAACCTTAACACCAAACCTCTCAACCCTAACACCATAACTTAACCCTAACCACAAA
+
FFDDGCCDEECDCECDEEFEEDBEFDEECDEEBBDECEFEDDGGCDDFFDC?CECE@DDEFEBF<CEE@EDABCDC8FD<$E(@CEEEFA<EAA5CD2DC
@V350096722L1C001R00100001181
CGACTTCTACCTAAATAAAACATCCAAAAATTAAATTATATTTTATAAAACTAATACCACCAAAACAAAAAAACACACATCTAAACTCCAATCACGTGTA
+
GHEEDCGCDFCDDFFCDFFDGD,GGDDEDDDDDDFC3DCDDCABDCDEEEGDDDBDGFFFFFCC<GDDFDDDC9DFDFF6C=DEDFA=GDF6GDFHBGCF

It has come from an MGI instrument, but it's nothing special. Falco is happy when I pass my R1 to it.

Any ideas why this would seg fault? All three outputs (summary, txt and html) are empty files when it faults with R2. When falco processes the R1 sequence, it's fine and the output looks good.

I'm running ubuntu 20.04 if that matters.

many thanks, Kieran

andrewdavidsmith commented 1 year ago

@kmshort I'm not able to reproduce the problem using the part of the R2 file you pasted above. It's possible someone else might be able to figure out the problem by looking at the info you provided already, but I don't think I could help without more info. One thing I would suggest: try to cut the file in half (wc -l to find the number of lines then use zcat and head/tail to get halves of the files, making sure to keep the number of lines a multiple of 4). Then see if you can use this approach a few times to narrow down a small example that causes the error. If you can do this, and you don't mind sending me the data, I can try to work with it.

Also, I should mention that I tested a fresh clone, so I'll go back when I can and try with the tar of v1.2.1 and see if it produces the error on this small fragment of the data.

kmshort commented 1 year ago

thanks @andrewdavidsmith I started doing what you suggested, but at the same time ran it on the entire dataset again after doing a system restart. It worked without segfaulting! So that's good, but raises questions in of itself. But for now, nothing to see here.

If anyone else has similar problems, I guess the first thing to do is "turn it off and on again". You'll be sure to hear if it starts happening again.

kmshort commented 1 year ago

Oh dear, it's seg faulting again. I'll see if I can divide and conquer to find the offending sequence(s), or other.

yangli04 commented 8 months ago

I had the same problem. The fastq file I used is SRR14562354

andrewdavidsmith commented 8 months ago

@yangli04 i need version info, environment and preferably a link to part of the fastq file. You gave the SRA run accession, but it's not always enough (eg, what version of fastq-dump; wget first; fasterq-dump, etc.). If you can use head from terminal and get a file to reproduce the problem, I can work directly with that. If you can't give me a small test file, give your command line and a hash of the input files (md5) so we can try to reproduce.

yangli04 commented 8 months ago

First, if I get the head from the file, I cannot reproduce the problem. Only if I use the whole .fastq.gz file created directly from fastq-dump, I can reproduce this problem.

I used falco 1.2.1 and fastq-dump 3.1.0.

Second, I think it might be the problem with compressing. When I gunzip the .fastq.gz file, using falco on the decompressed .fastq file will not produce any problem. Even if I use gzip to create a fastq.gz file from the unzipped .fastq file, it do not cause this problem either.

Then I get the md5sum of the two compressed files:

  1. The md5 of the .fastq.gz file directly created from .sra file by fastq-dump --split-3 --gzip SRR14562354.sra : 5b5c346a3897212d216b44cc8578536a
  2. The md5 of the .fastq.gz file from gunzip the .fastq.gz above, then gzip it again: d2e704fbf40abac121762ef2af506e81

It seems like the files are different.

md5 of my .sra file is ddd71a585d80515e4766f676dc7c0be1 SRR14562354.sra

andrewdavidsmith commented 8 months ago

@yangli04 I will be able to test this with your info. It might take some time. There's a chance the issue has been fixed in v1.2.2 because we did some updates related to the compression library since that was involved in faster processing of BAM format, which I think we incorporated between v1.2.1 and v1.2.2. So if you can tell me whether the problem is still present in v1.2.2 it might make things happen faster. That will be the first step for me in debugging when I have time for it.

yangli04 commented 8 months ago

@andrewdavidsmith Thank you. This problem solved magically in v1.2.2