marcelm / dnaio

Efficiently read and write sequencing data from Python
https://dnaio.readthedocs.io/
MIT License
62 stars 9 forks source link

Compressed (.gz) Writers need to be manually closed #146

Open mgperry opened 3 weeks ago

mgperry commented 3 weeks ago

I've encountered a tricky bug where writing uncompressed fasta is fine, but writing compressed fasta produces an empty file in some cases. This was inside a demultiplexing script (my own, not using cutadapt) where I was writing multiple (~100) files simultaneously.

The fix was manually closing the files (i.e. calling writer.close() for each one), but I would expect that this would be called automatically when the Writer object goes out of scope (as it appears to be for non-compressed files).

My best guess would be that the process responsible for writing the compressed files is getting silently dropped, however I don't really speak cython so I haven't been able to look at the source. If the fix for this isn't obvious, I can try to generate a minimal example for this.

Thanks.

rhpvorderman commented 3 weeks ago

Can you provide a minimal reproducer and a pip list of your python environment? There have been a few similar issues with the threaded gzip implementations that have been solved.

marcelm commented 3 weeks ago

I assume you also did not use xopen as a context manager. If so: You cannot rely on close() being called for you when the object is garbage collected (that is, when __del__ is called) because there’s no guarantee that the __del__ method is ever called.

I had problems in early Cutadapt versions when I was not so diligent closing each output file, that is, when I was relying on close() being called automatically for me.

Since you are working with many files, either use your fix of manually closing each file using close() or use contextlib.ExitStack, which takes care of most edge cases that can occur (for example, if opening one of the files fails and you get an exception, ExitStack ensures that the already opened files are still closed).

mgperry commented 3 weeks ago

@marcelm, thanks for the info (I did not know this), however I'm not sure it's relevant here. In this specific instance, I was calling the writers through a function, so (as I understand it) the destructors would be called when the function exits. I would also expect the writing to finish before the script exits (granted, buffering can be confusing but surely these should be flushed on program exit?).

mgperry commented 3 weeks ago

@rhpvorderman, I've made an example. It looks like this behaviour only occurs when the writing is behind a function. You can reproduce this using a single writer, which, when compressed, truncates its output. This doesn't occur with uncompressed files.

import dnaio
import random

# generate some data
DNA = ["A", "C", "G", "T"]

def random_dna(n):
    return dnaio.SequenceRecord("read", "".join(random.choices(DNA, k=n)))

reads = [random_dna(500) for _ in range(1000)]

# wrap writing inside a function
def write_demultiplexed(seqs, file):
    fa_file = dnaio.open(file, mode='w')

    for seq in seqs:
        fa_file.write(seq)

    # fix
    # fa_file.close()

write_demultiplexed(reads, "reads.fa.gz")
write_demultiplexed(reads, "reads.fa")

The results on my machine:

 $ zcat reads.fa.gz | grep "^>" | wc -l
gzip: reads.fa.gz: unexpected end of file
564

 $ cat reads.fa | grep "^>" | wc -l
1000

To my eyes, it looks like the compressed writing happens in a separate thread which is silently dropped when the function exits, rather than the function waiting for it to finish writing (which is what I would expect in Python).

Thanks for taking a look.

edit: pip list output, fresh 3.12.2 environment using pyenv installing only dnaio

Package Version
------- -------
dnaio   1.2.2
isal    1.7.1
pip     24.1.2
xopen   2.0.2
zlib-ng 0.5.1
marcelm commented 3 weeks ago

There’s no guarantee that __del__ is called, even within a function. It’s not like in C/C++/Rust etc. where a destructor is run as soon as the variable gets out of scope. Something else may be keeping the object alive.

Many file-like objects implement a __del__() method that calls close(), which is a convenience that often works, but since there’s no guarantee that __del__ ever runs, this gives a false sense of security.

mgperry commented 3 weeks ago

I've tracked this down to behaviour in xopen, and how it interacts with the stdlib gzip.GzipFile[^1]. Basically, using the default options[^2], xopen ends up calling _open_reproducible_gzip which creates an IGzipFile by first creating a file object, then passing this to the constructor. When the IGzipFile object goes of of scope, it doesn't close (or flush) the file object, and this isn't caught at the end of the program either.

I think this makes sense in a context where you pass in a file object, and python seems to clean this up properly (ie close the file and flush the buffer), at least if the file object is in the main scope. In the context of using dnaio.open to write a new file, I think this is highly unexpected and goes against the general idea of Least Astonishment.

I haven't tried to implement this, but I think if you change xopen._open_reproducible_gzip to create a GzipFile using filename rather than fileobj by default (ie unless a TextIOWrapper or similar) this would bypass the issue in most cases (and most importantly for the default case).

[^1]: From my tests, GzipFile and IGzipFile behave the same way. [^2]: Weirdly, if you set open_threads=1 (or anything greater than zero), this is completely bypassed by return a different writer class and everything works fine.

rhpvorderman commented 3 weeks ago

This is done to simulate the behavior of gzip --no-name. If you use the filename parameter the filename will be written to disk in the gzip header. This makes it harder to create reproducible output.

The GzipFile should work properly when you close it. You can't rely on the garbage collector to do this reliably.

rhpvorderman commented 3 weeks ago

Weirdly, if you set open_threads=1 (or anything greater than zero), this is completely bypassed by return a different writer class and everything works fine.

That is because I build the threaded gzip writer and I made it reproducible by default. Threaded gzip writing was deemed "too specialistic" for CPython and as a result python-zlib-ng and python-isal have a little more freedom in defining the threaded conventions.

mgperry commented 3 weeks ago

Thanks, that all makes sense. Nice work on the threaded implementations, this is off-topic but is that going to be materially faster in most use cases?

An alternative quick fix would be to add a __del__() method (sorry @marcelm) to the FileWriter class that closes the object when it goes out of scope, which might not be 100% reliable but works on my example. This should respect the _close_on_exit parameter for people who need to work with streams.

rhpvorderman commented 3 weeks ago

It is better to properly close files rather than to rely on the CPython behavior of the garbage collector.

So adding a __del__ method might fix your code now, but there is no guarantee this will work tomorrow. The only thing that is guaranteed is that everything will keep functioning correctly if you properly close the files.

See also this stackoverflow question and answer: https://stackoverflow.com/questions/7395542/is-explicitly-closing-files-important

rhpvorderman commented 3 weeks ago

this is off-topic but is that going to be materially faster in most use cases?

Yes. Compression is a heavy workload. Offloading it to another thread frees up the python thread (python is single threaded, as you know) to do actual interesting work. Using more than one offloading thread only matters if a level higher than 1 or 2 is chosen. For any bioinformatics tool I recommend setting the compression level at 1.

mgperry commented 3 weeks ago

It is better to properly close files rather than to rely on the CPython behavior of the garbage collector.

I don't disagree, thanks for taking the time to help with this. My suggestion was just to help others avoid what I would consider surprising behaviour in the future, if you don't feel it's an issue (or it's better solved with docs/examples) then that's fair, feel free to close.

marcelm commented 3 weeks ago

An alternative quick fix would be to add a __del__() method (sorry @marcelm) to the FileWriter class that closes the object when it goes out of scope, which might not be 100% reliable but works on my example. This should respect the _close_on_exit parameter for people who need to work with streams.

I’m not entirely opposed to this. I agree with Ruben that it’s risky because users should really use .close(), but it would lead to close() being called more often. The problem IMO is that there’s no error if close() is never called, so it’s very easy to just not do it and not even be aware that there is a problem.

I run my tests with PYTHONDEVMODE=1 (or -X dev on the command line) which will print warnings when files have not been closed. On the minimal example from above:

issue146.py:28: ResourceWarning: unclosed file <_io.BufferedWriter name='reads.fa'>
  write_demultiplexed(reads, "reads.fa")
ResourceWarning: Enable tracemalloc to get the object allocation traceback

If we decide to implement __del__, we would need to make sure that the warning is still triggered.

An alternative is to force users to use a context manager: The dnaio.open call would return an object that does not support reading, it would only provide an __enter__ method that returns the actual object one can use. I’ve thought about this before on other occasions in other projects, but never actually implemented it that way because I’m worried that it is too far from what people are used to. (And we don’t really have a novelty budget in dnaio.)