Open mgperry opened 3 weeks ago
Can you provide a minimal reproducer and a pip list
of your python environment? There have been a few similar issues with the threaded gzip implementations that have been solved.
I assume you also did not use xopen
as a context manager. If so: You cannot rely on close()
being called for you when the object is garbage collected (that is, when __del__
is called) because there’s no guarantee that the __del__
method is ever called.
I had problems in early Cutadapt versions when I was not so diligent closing each output file, that is, when I was relying on close()
being called automatically for me.
Since you are working with many files, either use your fix of manually closing each file using close()
or use contextlib.ExitStack, which takes care of most edge cases that can occur (for example, if opening one of the files fails and you get an exception, ExitStack ensures that the already opened files are still closed).
@marcelm, thanks for the info (I did not know this), however I'm not sure it's relevant here. In this specific instance, I was calling the writers through a function, so (as I understand it) the destructors would be called when the function exits. I would also expect the writing to finish before the script exits (granted, buffering can be confusing but surely these should be flushed on program exit?).
@rhpvorderman, I've made an example. It looks like this behaviour only occurs when the writing is behind a function. You can reproduce this using a single writer, which, when compressed, truncates its output. This doesn't occur with uncompressed files.
import dnaio
import random
# generate some data
DNA = ["A", "C", "G", "T"]
def random_dna(n):
return dnaio.SequenceRecord("read", "".join(random.choices(DNA, k=n)))
reads = [random_dna(500) for _ in range(1000)]
# wrap writing inside a function
def write_demultiplexed(seqs, file):
fa_file = dnaio.open(file, mode='w')
for seq in seqs:
fa_file.write(seq)
# fix
# fa_file.close()
write_demultiplexed(reads, "reads.fa.gz")
write_demultiplexed(reads, "reads.fa")
The results on my machine:
$ zcat reads.fa.gz | grep "^>" | wc -l
gzip: reads.fa.gz: unexpected end of file
564
$ cat reads.fa | grep "^>" | wc -l
1000
To my eyes, it looks like the compressed writing happens in a separate thread which is silently dropped when the function exits, rather than the function waiting for it to finish writing (which is what I would expect in Python).
Thanks for taking a look.
edit: pip list output, fresh 3.12.2
environment using pyenv installing only dnaio
Package Version
------- -------
dnaio 1.2.2
isal 1.7.1
pip 24.1.2
xopen 2.0.2
zlib-ng 0.5.1
There’s no guarantee that __del__
is called, even within a function. It’s not like in C/C++/Rust etc. where a destructor is run as soon as the variable gets out of scope. Something else may be keeping the object alive.
Many file-like objects implement a __del__()
method that calls close()
, which is a convenience that often works, but since there’s no guarantee that __del__
ever runs, this gives a false sense of security.
I've tracked this down to behaviour in xopen
, and how it interacts with the stdlib gzip.GzipFile
[^1]. Basically, using the default options[^2], xopen
ends up calling _open_reproducible_gzip
which creates an IGzipFile
by first creating a file object, then passing this to the constructor. When the IGzipFile
object goes of of scope, it doesn't close (or flush) the file object, and this isn't caught at the end of the program either.
I think this makes sense in a context where you pass in a file object, and python seems to clean this up properly (ie close the file and flush the buffer), at least if the file object is in the main scope. In the context of using dnaio.open
to write a new file, I think this is highly unexpected and goes against the general idea of Least Astonishment.
I haven't tried to implement this, but I think if you change xopen._open_reproducible_gzip
to create a GzipFile
using filename
rather than fileobj
by default (ie unless a TextIOWrapper
or similar) this would bypass the issue in most cases (and most importantly for the default case).
[^1]: From my tests, GzipFile
and IGzipFile
behave the same way.
[^2]: Weirdly, if you set open_threads=1
(or anything greater than zero), this is completely bypassed by return a different writer class and everything works fine.
This is done to simulate the behavior of gzip --no-name
. If you use the filename parameter the filename will be written to disk in the gzip header. This makes it harder to create reproducible output.
The GzipFile should work properly when you close it. You can't rely on the garbage collector to do this reliably.
Weirdly, if you set open_threads=1 (or anything greater than zero), this is completely bypassed by return a different writer class and everything works fine. ↩
That is because I build the threaded gzip writer and I made it reproducible by default. Threaded gzip writing was deemed "too specialistic" for CPython and as a result python-zlib-ng and python-isal have a little more freedom in defining the threaded conventions.
Thanks, that all makes sense. Nice work on the threaded implementations, this is off-topic but is that going to be materially faster in most use cases?
An alternative quick fix would be to add a __del__()
method (sorry @marcelm) to the FileWriter
class that closes the object when it goes out of scope, which might not be 100% reliable but works on my example. This should respect the _close_on_exit
parameter for people who need to work with streams.
It is better to properly close files rather than to rely on the CPython behavior of the garbage collector.
So adding a __del__
method might fix your code now, but there is no guarantee this will work tomorrow. The only thing that is guaranteed is that everything will keep functioning correctly if you properly close the files.
See also this stackoverflow question and answer: https://stackoverflow.com/questions/7395542/is-explicitly-closing-files-important
this is off-topic but is that going to be materially faster in most use cases?
Yes. Compression is a heavy workload. Offloading it to another thread frees up the python thread (python is single threaded, as you know) to do actual interesting work. Using more than one offloading thread only matters if a level higher than 1 or 2 is chosen. For any bioinformatics tool I recommend setting the compression level at 1.
It is better to properly close files rather than to rely on the CPython behavior of the garbage collector.
I don't disagree, thanks for taking the time to help with this. My suggestion was just to help others avoid what I would consider surprising behaviour in the future, if you don't feel it's an issue (or it's better solved with docs/examples) then that's fair, feel free to close.
An alternative quick fix would be to add a
__del__()
method (sorry @marcelm) to theFileWriter
class that closes the object when it goes out of scope, which might not be 100% reliable but works on my example. This should respect the_close_on_exit
parameter for people who need to work with streams.
I’m not entirely opposed to this. I agree with Ruben that it’s risky because users should really use .close()
, but it would lead to close()
being called more often. The problem IMO is that there’s no error if close()
is never called, so it’s very easy to just not do it and not even be aware that there is a problem.
I run my tests with PYTHONDEVMODE=1
(or -X dev
on the command line) which will print warnings when files have not been closed. On the minimal example from above:
issue146.py:28: ResourceWarning: unclosed file <_io.BufferedWriter name='reads.fa'>
write_demultiplexed(reads, "reads.fa")
ResourceWarning: Enable tracemalloc to get the object allocation traceback
If we decide to implement __del__
, we would need to make sure that the warning is still triggered.
An alternative is to force users to use a context manager: The dnaio.open
call would return an object that does not support reading, it would only provide an __enter__
method that returns the actual object one can use. I’ve thought about this before on other occasions in other projects, but never actually implemented it that way because I’m worried that it is too far from what people are used to. (And we don’t really have a novelty budget in dnaio.)
I've encountered a tricky bug where writing uncompressed fasta is fine, but writing compressed fasta produces an empty file in some cases. This was inside a demultiplexing script (my own, not using cutadapt) where I was writing multiple (~100) files simultaneously.
The fix was manually closing the files (i.e. calling
writer.close()
for each one), but I would expect that this would be called automatically when theWriter
object goes out of scope (as it appears to be for non-compressed files).My best guess would be that the process responsible for writing the compressed files is getting silently dropped, however I don't really speak cython so I haven't been able to look at the source. If the fix for this isn't obvious, I can try to generate a minimal example for this.
Thanks.