smithlabcode / falco

A C++ drop-in replacement of FastQC to assess the quality of sequence read data
https://falco.readthedocs.io
GNU General Public License v3.0
98 stars 11 forks source link

dont fail on empty file #74

Closed pdimens closed 2 weeks ago

pdimens commented 1 month ago

As part of a larger workflow, if falco is run on an empty fastq(.gz) file, it fails with an error code. Is it possible, either via a flag or by default behavior, to instead have falco output a report file that is empty? Example:

##Falco 1.2.4
>>Basic Statistics  fail
#Measure    Value
Filename    FILENAME.fq.gz
File type   Conventional base calls
Encoding    Sanger / Illumina 1.9
Total Sequences 0
Sequences flagged as poor quality   0
Sequence length 0
%GC 0
>>END_MODULE
andrewdavidsmith commented 1 month ago

@pdimens I'd have to think about the chances of confusing some users or masking problems in other parts of people's workflows. I personally like the binaries I use in my workflows to be conservative, and clean things up myself, rather than risk hiding other issues in other parts of the workflow. I'm open to making changes, but I'd need more details. Is it the exit code of non-zero that causes a failure of a workflow step for you, or is it the absence of an output file?

pdimens commented 1 month ago

It's technically both that are creating pain points. My current solution is to mimic a try/catch block in the shell:

( falco --quiet --threads 1 -skip-report -skip-summary -data-filename OUTFILE INFILE ) ||
cat <<EOF > {output}
##Falco 1.2.4
>>Basic Statistics  fail
#Measure    Value
Filename    $FILENAME.fq.gz
File type   Conventional base calls
Encoding    Sanger / Illumina 1.9
Total Sequences 0
Sequences flagged as poor quality   0
Sequence length 0
%GC 0
>>END_MODULE
EOF  

I understand what you mean about conservative behavior, which would be maintained with the addition of some kind of --allow-empty flag that would both exit gracefully and create a basic output file like the one above (that would be properly formatted for MultiQC).

andrewdavidsmith commented 1 month ago

I think that should be fine.

andrewdavidsmith commented 1 month ago

@pdimens I made a branch with this feature. Most of the work was related to fixing potential divide-by-zero when generating summary files if the inputs were empty. The current behavior wasn't intended, as it would just crash. In the new branch add-option-for-empty-input the default behavior checks all inputs to ensure they are non-empty, and if the -allow-empty-input is specified, it will run anyway, and generate outputs that have counts of 0 inside them. The output files themselves are not empty, but the information in them is. Let me know what you think.

pdimens commented 1 month ago

Thanks for looking into this. I cloned the repo but the branch doesn't have the configure file like the README specifies for manual compilation. It has a configre.ac file though, however I am unfamiliar with how to use it.

andrewdavidsmith commented 1 month ago

If there's a ./autoconf.sh it will generate the configure. Else I'll add one.

pdimens commented 1 month ago

Whether using the new flag or not, the program finishes successfully, which I expect is unintended behavior.

andrewdavidsmith commented 1 month ago

@pdimens I'll just paste my quick test which might be naive. Can you check that you get the same thing?

(base) [17:36][junior3 smithlabcode]$ git clone --recursive https://github.com/smithlabcode/falco.git
Cloning into 'falco'...
remote: Enumerating objects: 1351, done.
remote: Counting objects: 100% (310/310), done.
remote: Compressing objects: 100% (109/109), done.
remote: Total 1351 (delta 217), reused 248 (delta 192), pack-reused 1041 (from 1)
Receiving objects: 100% (1351/1351), 2.36 MiB | 1.50 MiB/s, done.
Resolving deltas: 100% (925/925), done.
(base) [17:36][junior3 smithlabcode]$ cd falco
(base) [17:36][junior3 falco (master)]$ git checkout add-option-for-empty-input 
Branch 'add-option-for-empty-input' set up to track remote branch 'add-option-for-empty-input' from 'origin'.
Switched to a new branch 'add-option-for-empty-input'
(base) [17:36][junior3 falco (add-option-for..)]$ ./autogen.sh 
configure.ac:22: installing './install-sh'
configure.ac:22: installing './missing'
Makefile.am: installing './depcomp'
parallel-tests: installing './test-driver'
(base) [17:36][junior3 falco (add-option-for..)]$ mkdir build && cd build
(base) [17:36][junior3 build (add-option-for..)]$ ../configure --enable-hts --enable-silent-rules
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a race-free mkdir -p... /usr/bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking whether make supports nested variables... yes
checking for g++... g++
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether the compiler supports GNU C++... yes
checking whether g++ accepts -g... yes
checking for g++ option to enable C++11 features... none needed
checking whether make supports the include directive... yes (GNU style)
checking dependency style of g++... gcc3
checking whether g++ supports C++17 features with -std=c++17... yes
checking for g++ -std=c++17 option to support OpenMP... -fopenmp
checking for zlibVersion in -lz... yes
checking for hts_version in -lhts... yes
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating Makefile
config.status: creating config.h
config.status: linking ../test/md5sum.txt to test_build/md5sum.txt
config.status: linking ../test/test_data.tgz to test_build/test_data.tgz
config.status: executing depfiles commands
(base) [17:36][junior3 build (add-option-for..)]$ make -j8
make  all-am
make[1]: Entering directory '/home/andrewds/code/smithlabcode/falco/build'
  CXX      src/falco-falco.o
  CXX      src/falco-FastqStats.o
  CXX      src/falco-HtmlMaker.o
  CXX      src/falco-Module.o
  CXX      src/falco-StreamReader.o
  CXX      src/falco-FalcoConfig.o
  CXX      src/falco-OptionParser.o
  CXX      src/falco-smithlab_utils.o
  CXXLD    falco
make[1]: Leaving directory '/home/andrewds/code/smithlabcode/falco/build'
(base) [17:36][junior3 build (add-option-for..)]$ make check
make  check-TESTS
make[1]: Entering directory '/home/andrewds/code/smithlabcode/falco/build'
make[2]: Entering directory '/home/andrewds/code/smithlabcode/falco/build'
PASS: test/falco.test
============================================================================
Testsuite summary for falco 1.2.4
============================================================================
# TOTAL: 1
# PASS:  1
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0
============================================================================
make[2]: Leaving directory '/home/andrewds/code/smithlabcode/falco/build'
make[1]: Leaving directory '/home/andrewds/code/smithlabcode/falco/build'
(base) [17:36][junior3 build (add-option-for..)]$ touch empty_file.fq
(base) [17:36][junior3 build (add-option-for..)]$ ./falco empty_file.fq 
Input file is empty: empty_file.fq
(base) [17:36][junior3 build (add-option-for..)]$ ./falco -allow-empty-input empty_file.fq 
[limits]    using file /home/andrewds/code/smithlabcode/falco/Configuration/limits.txt
[adapters]  using file /home/andrewds/code/smithlabcode/falco/Configuration/adapter_list.txt
[contaminants]  using file /home/andrewds/code/smithlabcode/falco/Configuration/contaminant_list.txt
[Thu Oct 24 17:37:01 2024] Started reading file empty_file.fq
[Thu Oct 24 17:37:01 2024] reading file as uncompressed FASTQ format
[running falco|===================================================|9223372036854775808%]
[Thu Oct 24 17:37:01 2024] Finished reading file
[Thu Oct 24 17:37:01 2024] Writing summary to ./summary.txt
[Thu Oct 24 17:37:01 2024] Writing text report to ./fastqc_data.txt
[Thu Oct 24 17:37:01 2024] Writing HTML report to ./fastqc_report.html
Elapsed time for file empty_file.fq: 0s
(base) [17:37][junior3 build (add-option-for..)]$ 
pdimens commented 1 month ago

Oh how strange, I ran it again and it worked as described.

pdimens commented 1 week ago

@andrewdavidsmith would it be possible to push this to a new release anytime soon?

andrewdavidsmith commented 1 week ago

Sure. I can probably do it in the next couple days.

pdimens commented 1 week ago

Wonderful, thank you!