Open mgperry opened 3 days ago
Can you provide a minimal reproducer and a pip list
of your python environment? There have been a few similar issues with the threaded gzip implementations that have been solved.
I assume you also did not use xopen
as a context manager. If so: You cannot rely on close()
being called for you when the object is garbage collected (that is, when __del__
is called) because there’s no guarantee that the __del__
method is ever called.
I had problems in early Cutadapt versions when I was not so diligent closing each output file, that is, when I was relying on close()
being called automatically for me.
Since you are working with many files, either use your fix of manually closing each file using close()
or use contextlib.ExitStack, which takes care of most edge cases that can occur (for example, if opening one of the files fails and you get an exception, ExitStack ensures that the already opened files are still closed).
@marcelm, thanks for the info (I did not know this), however I'm not sure it's relevant here. In this specific instance, I was calling the writers through a function, so (as I understand it) the destructors would be called when the function exits. I would also expect the writing to finish before the script exits (granted, buffering can be confusing but surely these should be flushed on program exit?).
@rhpvorderman, I've made an example. It looks like this behaviour only occurs when the writing is behind a function. You can reproduce this using a single writer, which, when compressed, truncates its output. This doesn't occur with uncompressed files.
import dnaio
import random
# generate some data
DNA = ["A", "C", "G", "T"]
def random_dna(n):
return dnaio.SequenceRecord("read", "".join(random.choices(DNA, k=n)))
reads = [random_dna(500) for _ in range(1000)]
# wrap writing inside a function
def write_demultiplexed(seqs, file):
fa_file = dnaio.open(file, mode='w')
for seq in seqs:
fa_file.write(seq)
# fix
# fa_file.close()
write_demultiplexed(reads, "reads.fa.gz")
write_demultiplexed(reads, "reads.fa")
The results on my machine:
$ zcat reads.fa.gz | grep "^>" | wc -l
gzip: reads.fa.gz: unexpected end of file
564
$ cat reads.fa | grep "^>" | wc -l
1000
To my eyes, it looks like the compressed writing happens in a separate thread which is silently dropped when the function exits, rather than the function waiting for it to finish writing (which is what I would expect in Python).
Thanks for taking a look.
edit: pip list output, fresh 3.12.2
environment using pyenv installing only dnaio
Package Version
------- -------
dnaio 1.2.2
isal 1.7.1
pip 24.1.2
xopen 2.0.2
zlib-ng 0.5.1
There’s no guarantee that __del__
is called, even within a function. It’s not like in C/C++/Rust etc. where a destructor is run as soon as the variable gets out of scope. Something else may be keeping the object alive.
Many file-like objects implement a __del__()
method that calls close()
, which is a convenience that often works, but since there’s no guarantee that __del__
ever runs, this gives a false sense of security.
I've encountered a tricky bug where writing uncompressed fasta is fine, but writing compressed fasta produces an empty file in some cases. This was inside a demultiplexing script (my own, not using cutadapt) where I was writing multiple (~100) files simultaneously.
The fix was manually closing the files (i.e. calling
writer.close()
for each one), but I would expect that this would be called automatically when theWriter
object goes out of scope (as it appears to be for non-compressed files).My best guess would be that the process responsible for writing the compressed files is getting silently dropped, however I don't really speak cython so I haven't been able to look at the source. If the fix for this isn't obvious, I can try to generate a minimal example for this.
Thanks.