Closed koriege closed 2 months ago
Urgh, this is bad. The 0.12.0 issue is different. That was simply a problem with detecting input on stdin.
I think the problem is with the vmsplice usage. I think that rapidgzip quits before all of the spliced output has been fully read, which leads to that memory being reused or repaged, or I don't know :/. I can reproduce the issue with your test file and can fix it by adding a sleep inside rapidgzip before it quits. And I can then reproduce the issue again if I add a larger sleep before the output is read like this: m rapidgzip && src/tools/rapidgzip -kcd issue-39.fastq.gz | ( sleep 2s; cat ) | python3 issue-39-reader.py
.
If you are installing rapidgzip from source, then you can comment out the #define HAVE_VMSPLICE
inside FileUtils.hpp
to fix the issue. ~I think it is random chance that 0.12.0 works, but I guess I could try a bisection to be sure.~ I can also reproduce the issue with 0.12.0 when adding sleep
before cat
as shown above. I don't think there is a way to query whether the splice output has been read. I guess I'll have to disable vmsplice again because it just is impossibly hard to use :(. Or, I'll need to refactor it work with a custom allocator based on mmap, so that I can be sure that the allocated memory is not reused even on program end.
I have pushed a fix that simply disables vmsplice. You can try it out with:
python3 -m pip install --force-reinstall 'git+https://github.com/mxmlnkn/indexed_bzip2.git@master#egginfo=rapidgzip&subdirectory=python/rapidgzip'
If it works for you, I'll release it as soon as possible and probably yank the old 0.13.1 release just to be sure.
Thanks for the immediate fix! I can confirm, that it works for the pip installed as well for the self compiled version.
mh.. By curiosity I compiled without commenting out the vmsplice section and currently I am unable to reproduce the issue. Now I am unsure if this problem lies somewhere in the cython wrapper instead or if I mixed something up.
Did you try the call with the sleep? rapidgzip -kcd test.fastq.gz | ( sleep 1s; python3 reader.py )
. With this, I can reproduce it very reliably. And, I'm also testing with the pure C++ version, so Cython shouldn't be an issue, but it might change the timing slightly.
Note that the fix will reduce performance until I get around to reimplement vmsplice cleanly with a custom mmap allocator, that should work then.
The windows wheels for rapidgzip 0.13.2 have already finished building and are uploaded already. The other wheels should have finished building in ~15 min.
Yes, I tried it that way, but might be a timing/caching thing as suggested. Looking forward to your re-implementation. Good luck with that!
@mxmlnkn is 0.13.0 affected by this bug?
@mxmlnkn is 0.13.0 affected by this bug?
Yes, and many even older versions. Note that this only affects Linux systems because only those have vmsplice.
@mxmlnkn is the bug only when another command is chained/piped after rapidgzip?
@mxmlnkn is the bug only when another command is chained/piped after rapidgzip?
Yes, that's correct.
Hi, I encountered a bug in v0.13.1 (python v3.6.15), where file integrity is broken when using xopen to read from rapidgzip stdout.
input file (test.fastq.gz)
reader.py (using xopen)
rapidgzip -kcd test.fastq.gz | python3 reader.py
results inSince xopen is for example used by
cutadapt
, this tool terminates with an errorA similar error occurs randomly when using
jellyfish
A workaround would be for now to use
cat
or a process group with asleep
I did not observe this issue in v0.12.0, though having a similar issue when exporting the index from stdin