mxmlnkn / rapidgzip

Gzip Decompression and Random Access for Modern Multi-Core Machines
Apache License 2.0
344 stars 7 forks source link

potential race condition #39

Closed koriege closed 2 months ago

koriege commented 2 months ago

Hi, I encountered a bug in v0.13.1 (python v3.6.15), where file integrity is broken when using xopen to read from rapidgzip stdout.

input file (test.fastq.gz)

@NB501486:703:HJLNGBGXV:4:23612:12350:20352 1:N:0:CCAACATT+AAGAGAGC
GCGGCGGCGGCAGCCGAGGCCCAAGCGCAGTCGGGGGCTGGGCGGGTGCGGGGGCTGCGGCCGTAGCTACCCCTGGGGACCAGGCAGAGGCAAATGTGGAGAGGGCCCCCGGCCATTTACCGGGGTGGGGGCCCCCGGCTCCTTGCATTG
+
AAAAAEEEEEEEEEEEEEEEEEAEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EEEEEEAEEEEEEEEEEEAEE/EEEEEEE<EEE<AA<AAAAAAAEEAEAAEEEEEEEAAAA/AAEEA<<A</EA/E<A

reader.py (using xopen)

from xopen import xopen
with xopen("/dev/stdin") as f:
    data = f.read()
    print(data)

rapidgzip -kcd test.fastq.gz | python3 reader.py results in

6:703:HJLNGBGXV:4:23612:12350:20352 1:N:0:CCAACATT+AAGAGAGC
GCGGCGGCGGCAGCCGAGGCCCAAGCGCAGTCGGGGGCTGGGCGGGTGCGGGGGCTGCGGCCGTAGCTACCCCTGGGGACCAGGCAGAGGCAAATGTGGAGAGGGCCCCCGGCCATTTACCGGGGTGGGGGCCCCCGGCTCCTTGCATTG
+
AAAAAEEEEEEEEEEEEEEEEEAEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE<EEEEEEAEEEEEEEEEEEAEE/EEEEEEE<EEE<AA<AAAAAAAEEAEAAEEEEEEEAAAA/AAEEA<<A</EA/E<A

Since xopen is for example used by cutadapt, this tool terminates with an error

cutadapt --quiet <(rapidgzip -kcd test.fastq.gz)
cutadapt: error: Could not determine whether file <_io.BufferedReader name='/dev/fd/63'> is FASTA or FASTQ. The file extension was not available or not recognized and the first character in the file is unexpected.

A similar error occurs randomly when using jellyfish

jellyfish bc -m 23 -s 100000000 -o /dev/null <(rapidgzip -kcd test.fastq.gz)
terminate called after throwing an instance of 'std::runtime_error'
  what():  Unsupported format
Aborted (core dumped)

A workaround would be for now to use cat or a process group with a sleep

rapidgzip -kcd test.fastq.gz | cat | python3 reader.py
cutadapt --quiet <(rapidgzip -kcd test.fastq.gz | cat)

I did not observe this issue in v0.12.0, though having a similar issue when exporting the index from stdin

rapidgzip -kcd test.fastq.gz | gzip -kc | rapidgzip -f --export-index test.fastq.idx
Either stdin must have input, e.g., by piping to it, or an input file must be specified!
mxmlnkn commented 2 months ago

Urgh, this is bad. The 0.12.0 issue is different. That was simply a problem with detecting input on stdin.

I think the problem is with the vmsplice usage. I think that rapidgzip quits before all of the spliced output has been fully read, which leads to that memory being reused or repaged, or I don't know :/. I can reproduce the issue with your test file and can fix it by adding a sleep inside rapidgzip before it quits. And I can then reproduce the issue again if I add a larger sleep before the output is read like this: m rapidgzip && src/tools/rapidgzip -kcd issue-39.fastq.gz | ( sleep 2s; cat ) | python3 issue-39-reader.py.

If you are installing rapidgzip from source, then you can comment out the #define HAVE_VMSPLICE inside FileUtils.hpp to fix the issue. ~I think it is random chance that 0.12.0 works, but I guess I could try a bisection to be sure.~ I can also reproduce the issue with 0.12.0 when adding sleep before cat as shown above. I don't think there is a way to query whether the splice output has been read. I guess I'll have to disable vmsplice again because it just is impossibly hard to use :(. Or, I'll need to refactor it work with a custom allocator based on mmap, so that I can be sure that the allocated memory is not reused even on program end.

mxmlnkn commented 2 months ago

I have pushed a fix that simply disables vmsplice. You can try it out with:

python3 -m pip install --force-reinstall 'git+https://github.com/mxmlnkn/indexed_bzip2.git@master#egginfo=rapidgzip&subdirectory=python/rapidgzip'

If it works for you, I'll release it as soon as possible and probably yank the old 0.13.1 release just to be sure.

koriege commented 2 months ago

Thanks for the immediate fix! I can confirm, that it works for the pip installed as well for the self compiled version.

koriege commented 2 months ago

mh.. By curiosity I compiled without commenting out the vmsplice section and currently I am unable to reproduce the issue. Now I am unsure if this problem lies somewhere in the cython wrapper instead or if I mixed something up.

mxmlnkn commented 2 months ago

Did you try the call with the sleep? rapidgzip -kcd test.fastq.gz | ( sleep 1s; python3 reader.py ). With this, I can reproduce it very reliably. And, I'm also testing with the pure C++ version, so Cython shouldn't be an issue, but it might change the timing slightly.

Note that the fix will reduce performance until I get around to reimplement vmsplice cleanly with a custom mmap allocator, that should work then.

The windows wheels for rapidgzip 0.13.2 have already finished building and are uploaded already. The other wheels should have finished building in ~15 min.

koriege commented 2 months ago

Yes, I tried it that way, but might be a timing/caching thing as suggested. Looking forward to your re-implementation. Good luck with that!

tooptoop4 commented 2 months ago

@mxmlnkn is 0.13.0 affected by this bug?

mxmlnkn commented 2 months ago

@mxmlnkn is 0.13.0 affected by this bug?

Yes, and many even older versions. Note that this only affects Linux systems because only those have vmsplice.

tooptoop4 commented 2 months ago

@mxmlnkn is the bug only when another command is chained/piped after rapidgzip?

mxmlnkn commented 2 months ago

@mxmlnkn is the bug only when another command is chained/piped after rapidgzip?

Yes, that's correct.