marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
513 stars 129 forks source link

Cutadapt 4.7 hangs when using process substitution #772

Closed Redmar-van-den-Berg closed 5 months ago

Redmar-van-den-Berg commented 6 months ago

I'm using process substitution to merge and trim multiple FastQ files in a single step using cutadapt. However, cutadapt version 4.7 hangs, while 4.6 does not

cutadapt <(pigz --decompress --stdout R1.fastq.gz) <(pigz --decompress --stdout R2.fastq.gz)

Cutadapt: docker://quay.io/biocontainers/cutadapt:4.7--py310h4b81fae_1 Cutadapt does write the start of a log file:

This is cutadapt 4.7 with Python 3.10.13
Command line parameters: -a AGATCGGAAGAG -A AGATCGGAAGAG --cores=8 --compression-level=1 --minimum-length=20 --quality-cutoff=20,20 --output=TestSample3/qc-seq/TestSample3-R1.fq.gz --paired-output=TestSample3/qc-seq/TestSample3-R2.fq.gz --json=TestSample3/qc-seq/TestSample3.cutadapt.json /dev/fd/63 /dev/fd/62

This issue does not occur with docker://quay.io/biocontainers/cutadapt:4.6--py39hf95cd2a_1

marcelm commented 6 months ago

I can only look at this closer next week, but in the meantime, my recommendation is to use the file names directly. Cutadapt goes through a lot of effort to make decompression of input files and compression of output files efficient. Overriding how Cutadapt does it is likely counterproductive (cf https://github.com/pycompression/xopen/#backends).

Jerrythafast commented 6 months ago

We are running into the same issue here. In our case, the data comes from a giant zip file on a network drive, so we aim to avoid writing intermediate files by having cutadapt read directly from unzip command:

cutadapt --interleaved \
  <(unzip -p /path/to/giant.zip internal/path/to/data_R1_001.fastq.gz) \
  <(unzip -p /path/to/giant.zip internal/path/to/data_R2_001.fastq.gz) | \
next-command ...

I doubt this is a use case that cutadapt supports natively (I would be seriously impressed if it did!) but it works perfectly in 4.6 :) With 4.7, it hangs and leaves a zombie process (not sure if that is a cause or a side-effect, but it might be worth noting). Running on RHEL9.3, Python 3.9.18.

rhpvorderman commented 6 months ago

@marcelm, this sounds like an xopen issue I can look into. @Redmar-van-den-Berg is a colleague of mine.

@Redmar-van-den-Berg have you tried reproducing this in a pip virtualenv, and then trying pip install xopen==1.9.0 vs pip install xopen==1.8.0 ? It sounds like an xopen bug.

I have to concur with @marcelm here that using the filepaths directly will result in much faster peformance. Xopen used to delegate to pigz which was faster than python's builtin gzip, but that was years ago. Now it delefates to ISA-L which is twice as fast. A lot of work has been done to optimize gzip decompression.

marcelm commented 6 months ago

Thanks both for the reports! @Redmar-van-den-Berg I hope just not using process substitution is a good enough fix for you.

@Jerrythafast As a workaround, you can use named FIFOs instead (or of course stay at Cutadapt 4.6 for the moment):

mkfifo R1.fastq.gz
mkfifo R2.fastq.gz
unzip -p /path/to/giant.zip internal/path/to/data_R1_001.fastq.gz > R1.fastq.gz &
unzip -p /path/to/giant.zip internal/path/to/data_R2_001.fastq.gz > R2.fastq.gz &
cutadapt --interleaved R1.fastq.gz R2.fastq.gz | next-command ...
rm R1.fastq.gz R2.fastq.gz

I can reproduce the problem even with older xopen versions using this command:

cutadapt --debug --cores=2 <(cat tests/data/nextseq.fastq) -o /dev/null

This happens only when running on multiple cores and arises because I changed the multiprocessing start method in 4.7.

In multicore mode, Cutadapt starts a separate process for reading the input FASTQ data. The reader process gets the name of the file to open. When process substitution is used, the name is something like /dev/fd/63. Since the reader is a different process that does not share open file descriptors with the main process, /dev/fd/63 does not exist and the reader process exits with an exception: No such file or directory: '/dev/fd/63'. (An additional problem is that the exception message is not printed and that the main process just hangs instead.)

Before Cutadapt 4.7, all input files would be opened in the main process. This worked because I used the fork method for starting processes, and when using fork, all open file descriptors are shared with the child process. However, fork is not the default on macOS and seems to be safer even on Linux, so I switched the default to spawn, which means that open file descriptors are no longer shared automatically. When I made that switch, I looked into how to share file descriptors with a spawned process, but it got so complicated and system-specific that it didn’t feel like a good use of my time.

A temporary fix could be to switch back to using fork (which seems to work). However, fork isn’t safe to use when threads are involved, and the most recent xopen version actually has started to use threads, so in the long run, we need a different fix.

Jerrythafast commented 6 months ago

Thanks for the info @marcelm! It's fine for us to stay at 4.6 for now. It makes sense to switch to spawn due to the issues with fork and threading and I can totally see how that messes things up if the file gets opened in a subprocess.

I tried setting --cores 1 to avoid spawning the subprocess in xopen (tracing the code it looks like that should pass threads=0 to xopen, which is documented not to spawn a child process if it doesn't need to). However, cutadapt v4.7 still crashes on assert path is not None here, so possibly there is also some work on the cutadapt side (couldn't quickly identify the source of the None value just by reading code).

Full command line parameters (from log file): --cores 1 --adapter TGGAATTCTCGGGTGCCAAGG --minimum-length 2 --interleaved /dev/fd/63 /dev/fd/62

marcelm commented 6 months ago

However, cutadapt v4.7 still crashes on assert path is not None

Thanks! This seems to be a separate problem with interleaved output. I filed a new issue about this.

Redmar-van-den-Berg commented 6 months ago

@marcelm Thanks for looking into this!

I understand that using process substitution is a bit of a weird use case, but I like the idea of trimming and merging in a single step. I've tried using named pipes, but found this comes with a 50% performance penalty when using a shared (cluster) filesystem.

Would you be interested in adding an option to cutadapt to add multiple fastq pairs, to be processed in order? This would sidestep the problem altogether.

As a workaround, I've found that using a small helper script to interleaf the input and pipe it to cutadapt seems to come with no performance penalty: ~/scripts/interleaf.py --forward $f1 $f2 --reverse $r1 $r2 | cutadapt - --interleaved -o /dev/null -p /dev/null

#!/usr/bin/env python

import argparse
import dnaio
import xopen

def main(forward, reverse):

    for f, r in zip(forward, reverse):
        # Open the fastq files
        f_in = dnaio.open(f, opener=xopen.xopen)
        r_in = dnaio.open(r, opener=xopen.xopen)

        # Iterate over all reads
        for f_read, r_read in zip(f_in, r_in):
            print(
                f_read.fastq_bytes().decode(),
                r_read.fastq_bytes().decode(),
                end="",
                sep="",
            )

if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    parser.add_argument("--forward", nargs="+", required=True)
    parser.add_argument("--reverse", nargs="+", required=True)

    args = parser.parse_args()

    assert len(args.forward) == len(args.reverse)

    main(args.forward, args.reverse)
marcelm commented 6 months ago

I understand that using process substitution is a bit of a weird use case,

Nah, this is totally normal :smile:! I also also process substitution a lot (not necessarily with Cutadapt) and think this needs to be fixed.

but I like the idea of trimming and merging in a single step. [...] Would you be interested in adding an option to cutadapt to add multiple fastq pairs, to be processed in order?

Maybe, can you file this as a separate issue?

I've tried using named pipes, but found this comes with a 50% performance penalty when using a shared (cluster) filesystem.

This is strange since named pipes, to my understanding, work just like regular (unnamed) pipes, except that they have a name. Nothing should be written to or read from the file system. Named pipes can even reside on read-only file systems. The only network traffic should be the lookup of the inode. And you could even put it on /tmp to skip the network entirely (assuming that resides on node-local storage).

In your script, the print() decodes the bytes you get from fastq_bytes and then encodes them again (implicitly) when sending them to stdout. You can avoid this by writing sys.stdout.buffer.write(f_read.fastq_bytes()). But it’ll probably make only a small difference.

Redmar-van-den-Berg commented 6 months ago

In your script, the print() decodes the bytes you get from fastq_bytes and then encodes them again (implicitly) when sending them to stdout. You can avoid this by writing sys.stdout.buffer.write(f_read.fastq_bytes()). But it’ll probably make only a small difference.

I've tried this, but you are right that it doesn't make much difference. Probably because only interleaving the fastq files and writing them to /dev/null is already twice as fast as running cutadapt.

marcelm commented 5 months ago

I reverted the change where I switched the default multiprocessing start method to spawn, so it’ll work again. However, in one of the next Python releases, spawn will become the default, and I’ll have to think once again about how to solve this properly. I have in any case added a test to catch this problem in the future.

I’ll release Cutadapt 4.8 with this fix.