smithlabcode / falco

A C++ drop-in replacement of FastQC to assess the quality of sequence read data
https://falco.readthedocs.io
GNU General Public License v3.0
90 stars 10 forks source link

Memory leak or stall? #40

Closed schorlton closed 1 year ago

schorlton commented 1 year ago

Thanks for falco!

Running v1.2.0 installed from bioconda on a nanopore FASTQ

falco --format fastq -skip-report -t 1 -skip-summary nanopore.fastq.gz 
[limits]    using default limit cutoffs (no file specified)
[adapters]  using default adapters (no file specified)
[contaminants]  using default contaminant list (no file specified)
[Thu Sep 15 10:26:59 2022] Started reading file nanopore.fastq.gz
[Thu Sep 15 10:27:00 2022] reading file as gzipped FASTQ format
[running falco|===================================================|100%]
[Thu Sep 15 10:27:13 2022] Finished reading file
[Thu Sep 15 10:27:13 2022] Writing text report to ./fastqc_data.txt

It looks like it generates the fastqc_data.txt properly, but then after that, it consumes over 32GB of RAM over several minutes until I kill it... Is this expected? For comparison, FastQC processes the file in 18s within 1GB RAM.

Here are the stats of the file:

seqkit stats nanopore.fastq.gz 
file              format  type  num_seqs      sum_len  min_len  avg_len  max_len
nanopore.fastq.gz  FASTQ   DNA     30,720  204,534,138      100    6,658   35,768
guilhermesena1 commented 1 year ago

Hi

this may be a strange thing to assess but would you be able to try running the same thing without --format fastq (let falco figure out the format) and let us know if you have the same problem? May help us with zero in on the problem

thank you!

schorlton commented 1 year ago

Thanks for your quick response! Running without --format fastq hit the same problem. I also tried running on the ungzipped file without --format fastq and that also hit the same issue. Thanks!

guilhermesena1 commented 1 year ago

Hm that's very strange indeed. I had a suspicion that --format fastq was forcing the program to interpret the gzipped file as uncompressed, thus leading to problems.

would it be too much to ask if you could either share the entire dataset or a small subset that causes the problem on your machine? I'd be very interested in seeing if I can reproduce too, this may be more complicated than I thought. I understand that it's tricky to ask those things for private data so no worries if you cannot, or if it's more convenient you can always send it to me at desenabr[at]usc[dot]edu.

finally - and this is admittedly a hail mary - does the program also fail if you don't use any flags? Just falco nanopore.fastq.gz?

Thank you!

andrewdavidsmith commented 1 year ago

@schorlton If you are only able to share part of the dataset, please include something with the same structure for the read names unless they are really short and contiguous. Also, try to include any special characters that would be in the original file, for example by obtaining the smaller part with command line tools. Also, the most helpful part would have similar variability in lengths of reads. Thanks!

schorlton commented 1 year ago

Running falco nanopore.fastq.gz hits the same issue. I've managed to reproduce on the first random public nanopore sequencing file I pulled from SRA: SRR21529942.fastq.gz Looking forward to your analysis.

andrewdavidsmith commented 1 year ago

I can report that, using the provided file:

If @guilhermesena1 notices the same, we will need to find the problem and update conda. In the meantime, @schorlton if you are in an environment where you can build falco from source, maybe you could try that using the source here. If you want to do that and encounter problems we can try to help. Otherwise we hope you'll wait for the fix on conda, which we will try to get done asap.

schorlton commented 1 year ago

Glad I'm not going crazy :stuck_out_tongue_closed_eyes: Thanks for your investigation - I'll stay tuned for a conda fix as I can use FastQC in the interim. -Sam

andrewdavidsmith commented 1 year ago

Problem identified. Hopefully we can have a fix soon on conda. If you need it before we can do that, you'll have to try building from source -- preferably the release I linked above.

guilhermesena1 commented 1 year ago

The problem can also be reproduced on a clone/release if we delete/rename the files in the Configuration directory, so there may be something to do with the defaults.

@schorlton if you need a quick fix you can point to the configuration files in conda as follows

$ falco -c /path/to/your/conda/opt/falco-1.2.0/Configuration/contaminant_list.txt \
        -a /path/to/your/conda/opt/falco-1.2.0/Configuration/adapter_list.txt \
        -l /path/to/your/conda/opt/falco-1.2.0/Configuration/limits.txt \
        [the rest of your parameters here]

I think when compiling falco internally through conda (since they only leave the binary), they're still not finding the Configuration folder under /opt. The way fastqc does it is compiling everything at the conda root directory, moving everything to opt then making a symlink under the bin directory that points to the binary under opt. I didn't like that approach very much and was trying to avoid it, but I guess if we want to emulate their behavior, we will emulate their behavior. We'll try to fix this but this may take a bit longer since it's done through a pull request.

guilhermesena1 commented 1 year ago

I think I found the source of the problem and pushed a possible fix at bca5f11

When we used the default values the shortest adapter size wasn't being set properly, and decreasing 1 lead to underflow. You should also be able to run the code just using the -a flag in the command above. In any case we'll roll this into a new release asap and update on conda.

Sorry for the inconvenience and thank you again for sharing the issue with us!

schorlton commented 1 year ago

Thanks for the quick fix! Will test out the conda update once released.

schorlton commented 1 year ago

Unfortunately now getting a new error on 1.2.1 from bioconda:

falco -skip-summary -skip-report nanopore.fastq.gz 
[limitst]   using file /opt/conda/opt/falco-1.2.1/Configuration/limits.txt
instruction for limit duplication not found in file /opt/conda/opt/falco-1.2.1
guilhermesena1 commented 1 year ago

yep trying to figure this one out. what's weird is it is reading the correct file, and even more, so if you do

falco -l opt/conda/opt/falco-1.2.1/Configuration/limits.txt [etc etc]

it runs ok, but not if you don't specify the file, even if the URL is correct.

guilhermesena1 commented 1 year ago

alright, I know better than to get my hopes up, but we made some changes to the most recent release and merged them to conda to address the issue.

As far as I can tell, a conda download of falco puts the config files at opt/falco/Configuration and the falco binary in conda correctly uses these files to process the data. @schorlton if you are still having issues and are still interested in testing the most recent changes, we would love to know if things are running correctly on your end too. Just bear in mind you may have to run conda remove falco prior to updating.

schorlton commented 1 year ago

It works - thank you! The only remaining issue is specifying the file format. Specifying --format fastq when the file is compressed FASTQ causes a crash. Omitting this works.

guilhermesena1 commented 1 year ago

We will create another issue to address this. Ideally fq and fq.gz should be treated equally, but the reading and process is a bit different, and technically calling --format fastq reports on the output that it is being read as uncompressed and failing makes sense since the input is technically malformatted. Regardless this does warrant clarification

I'll close this issue considering that the initial problem has been resolved, but any further problems regarding the stalling can always be added by reopening this.