marcelm / dnaio

Efficiently read and write sequencing data from Python
https://dnaio.readthedocs.io/
MIT License
62 stars 9 forks source link

Compressed (.gz) Writers need to be manually closed #146

Open mgperry opened 3 days ago

mgperry commented 3 days ago

I've encountered a tricky bug where writing uncompressed fasta is fine, but writing compressed fasta produces an empty file in some cases. This was inside a demultiplexing script (my own, not using cutadapt) where I was writing multiple (~100) files simultaneously.

The fix was manually closing the files (i.e. calling writer.close() for each one), but I would expect that this would be called automatically when the Writer object goes out of scope (as it appears to be for non-compressed files).

My best guess would be that the process responsible for writing the compressed files is getting silently dropped, however I don't really speak cython so I haven't been able to look at the source. If the fix for this isn't obvious, I can try to generate a minimal example for this.

Thanks.

rhpvorderman commented 3 days ago

Can you provide a minimal reproducer and a pip list of your python environment? There have been a few similar issues with the threaded gzip implementations that have been solved.

marcelm commented 1 day ago

I assume you also did not use xopen as a context manager. If so: You cannot rely on close() being called for you when the object is garbage collected (that is, when __del__ is called) because there’s no guarantee that the __del__ method is ever called.

I had problems in early Cutadapt versions when I was not so diligent closing each output file, that is, when I was relying on close() being called automatically for me.

Since you are working with many files, either use your fix of manually closing each file using close() or use contextlib.ExitStack, which takes care of most edge cases that can occur (for example, if opening one of the files fails and you get an exception, ExitStack ensures that the already opened files are still closed).

mgperry commented 5 hours ago

@marcelm, thanks for the info (I did not know this), however I'm not sure it's relevant here. In this specific instance, I was calling the writers through a function, so (as I understand it) the destructors would be called when the function exits. I would also expect the writing to finish before the script exits (granted, buffering can be confusing but surely these should be flushed on program exit?).

mgperry commented 4 hours ago

@rhpvorderman, I've made an example. It looks like this behaviour only occurs when the writing is behind a function. You can reproduce this using a single writer, which, when compressed, truncates its output. This doesn't occur with uncompressed files.

import dnaio
import random

# generate some data
DNA = ["A", "C", "G", "T"]

def random_dna(n):
    return dnaio.SequenceRecord("read", "".join(random.choices(DNA, k=n)))

reads = [random_dna(500) for _ in range(1000)]

# wrap writing inside a function
def write_demultiplexed(seqs, file):
    fa_file = dnaio.open(file, mode='w')

    for seq in seqs:
        fa_file.write(seq)

    # fix
    # fa_file.close()

write_demultiplexed(reads, "reads.fa.gz")
write_demultiplexed(reads, "reads.fa")

The results on my machine:

 $ zcat reads.fa.gz | grep "^>" | wc -l
gzip: reads.fa.gz: unexpected end of file
564

 $ cat reads.fa | grep "^>" | wc -l
1000

To my eyes, it looks like the compressed writing happens in a separate thread which is silently dropped when the function exits, rather than the function waiting for it to finish writing (which is what I would expect in Python).

Thanks for taking a look.

edit: pip list output, fresh 3.12.2 environment using pyenv installing only dnaio

Package Version
------- -------
dnaio   1.2.2
isal    1.7.1
pip     24.1.2
xopen   2.0.2
zlib-ng 0.5.1
marcelm commented 4 hours ago

There’s no guarantee that __del__ is called, even within a function. It’s not like in C/C++/Rust etc. where a destructor is run as soon as the variable gets out of scope. Something else may be keeping the object alive.

Many file-like objects implement a __del__() method that calls close(), which is a convenience that often works, but since there’s no guarantee that __del__ ever runs, this gives a false sense of security.