Closed koriege closed 8 months ago
Which version are you using? Could you please downgrade to 0.11.1 and test that? I'm getting 0 0 0 0
for that version. I know the problem with the newer versions. I'm having some trouble correctly initializing rpmalloc in the myriad of use cases. Maybe I'll simply disable it until rpmalloc 2.0, which is easier to initialize, is released.
Oh sorry, my fault. I was still on 10.4 v0.11.0 and v0.11.1 are working as ~ expected (exit code 0)
Ah ok. The error on 0.10.4 is ValueError: This class heavily relies on seeking and won't work with unseekable files!
. Support for non-seekable files is a new feature in 0.11.0.
sorry again for beeing unprecise, I tested multiple commands. beeing on v011.1, I get the following trace with an exit code 0
seq 1 100000| gzip -kc > foo.gz; rapidgzip -kcd foo.gz | head -1
1
error: 32
Traceback (most recent call last):
File "/home/lakatos/kriege/programs/conda/envs/test/bin/rapidgzip", line 8, in <module>
sys.exit(cli())
^^^^^
File "rapidgzip.pyx", line 665, in rapidgzip.cli
RuntimeError: Failed to write to pipe
Ok, it seems to work just fine. You could pipe stderr to /dev/null to avoid the traceback output but I guess you don't want to do that because it is not necessary for gzip? I probably can catch that exception and suppress the traceback and it should work just fine.
this would be great. I am also wondering, why this trace does not occur for smaller files (seq to 10k instead of 100k)
seq 1 10000 | gzip -kc > foo.gz; rapidgzip -kcd foo.gz | head -1
I probably can catch that exception and suppress the traceback and it should work just fine.
There are some conceptual problems, though, when it comes to combined command-line actions. For example, you can do: src/tools/rapidgzip -kcd --export-index index --count foo.gz> /dev/null | head
and it probably should still generate the full index and maybe even return the correct count on stderr. However, without any of these additional options, we would want to stop decoding as soon as the output stream end has been encountered. Therefore, it seems like a simple try-catch will not work, and instead, some kind of flag is necessary.
I am also wondering, why this trace does not occur for smaller files (seq to 10k instead of 100k)
seq 1 10000 | gzip -kc > foo.gz; rapidgzip -kcd foo.gz | head -1
Interesting question. This is because I'm using vmsplice, which is kind of asynchronous. The first call will work just fine, while the second or third call, depending on how much time has passed, will return the broken pipe error. On my system seq 1 14117
works but seq 1 14118
yields the exception.
> seq 1 14117 | gzip -kc > foo.gz; python3 -u -c 'import rapidgzip; rapidgzip.cli()' -kcd foo.gz | head -1 > /dev/null; echo ${PIPESTATUS[*]}
[writeAll] outputFileDescriptor: 1, dataToWriteSize: 73596
[writeAllSplice] outputFileDescriptor: 1, buffersToWrite count: 1
[SpliceVault::splice] buffersToWrite count: 1
[writeAllSpliceUnsafe] vmsplice( 1, i=0, segmentCount=1)
[writeAllSpliceUnsafe] nBytesWritten: 65408)
[writeAllSpliceUnsafe] Write remaining iovec. Currently at i=0 out of 1, nBytesWritten: 65408
[writeAllSpliceUnsafe] vmsplice( 1, dataToSplice.iov_len:8188)
[writeAllSpliceUnsafe] nBytesWritten:8188
0 0
> seq 1 14118 | gzip -kc > foo.gz; python3 -u -c 'import rapidgzip; rapidgzip.cli()' -kcd foo.gz | head -1 > /dev/null; echo ${PIPESTATUS[*]}
[writeAll] outputFileDescriptor: 1, dataToWriteSize: 73602
[writeAllSplice] outputFileDescriptor: 1, buffersToWrite count: 1
[SpliceVault::splice] buffersToWrite count: 1
[writeAllSpliceUnsafe] vmsplice( 1, i=0, segmentCount=1)
[writeAllSpliceUnsafe] nBytesWritten: 65408)
[writeAllSpliceUnsafe] Write remaining iovec. Currently at i=0 out of 1, nBytesWritten: 65408
[writeAllSpliceUnsafe] vmsplice( 1, dataToSplice.iov_len:8194)
[writeAllSpliceUnsafe] nBytesWritten:8192
[writeAllSpliceUnsafe] vmsplice( 1, dataToSplice.iov_len:2)
[writeAllSpliceUnsafe] nBytesWritten:-1
error: 32
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "rapidgzip.pyx", line 707, in rapidgzip.cli
RuntimeError: Failed to write to pipe
1 0
It looks a bit different when calling rapidgzip built with CMake (without the Python layer):
> seq 1 14118 | gzip -kc > foo.gz; src/tools/rapidgzip -kcd foo.gz | head -1 > /dev/null; echo ${PIPESTATUS[*]}
[writeAll] outputFileDescriptor: 1, dataToWriteSize: 73602
[writeAllSplice] outputFileDescriptor: 1, buffersToWrite count: 1
[SpliceVault::splice] buffersToWrite count: 1
[writeAllSpliceUnsafe] vmsplice( 1, i=0, segmentCount=1)
[writeAllSpliceUnsafe] nBytesWritten: 65408
[writeAllSpliceUnsafe] Write remaining iovec. Currently at i=0 out of 1, remaining bytes to write: 65408
[writeAllSpliceUnsafe] vmsplice( 1, dataToSplice.iov_len:8194)
[writeAllSpliceUnsafe] nBytesWritten:8192
[writeAllSpliceUnsafe] vmsplice( 1, dataToSplice.iov_len:2)
141 0
For example, you can do:
src/tools/rapidgzip -kcd --export-index index --count foo.gz> /dev/null | head
and it probably should still generate the full index and maybe even return the correct count on stderr.
I don't get your point here. This command will always generate the full index and terminates without receiving any signal by head
, because head
will never get any stdin. On the other hand, when the user is not redirecting to /dev/null
, and head
is going to kill the rapidgzip process, for sure also the index will be and should be corrupted as a result of what the user did.
The first call will work just fine, while the second or third call, depending on how much time has passed, will return the broken pipe error.
I see, but can't you simply exit the python main upon catching BokenPipeError and KeyboardInterrupt (SIGINT, or ctr+c)
I don't get your point here. This command will always generate the full index and terminates without receiving any signal by
head
, becausehead
will never get any stdin. On the other hand, when the user is not redirecting to/dev/null
, andhead
is going to kill the rapidgzip process, for sure also the index will be and should be corrupted as a result of what the user did.
My bad. Not sure where the >/dev/null
came from. I meant simply to combine -c
and --export-index
, which is allowed. It will write the decompressed result to stdout and export the index after it has finished decompressing.
I have pushed a commit. You can try it out with:
python3 -m pip install --force-reinstall 'git+https://github.com/mxmlnkn/indexed_bzip2.git@master#egginfo=rapidgzip&subdirectory=python/rapidgzip'
Some further notes:
I am not that well-versed in POSIX but the exit code 141 is very likely shell-dependent because the program (gzip or even rapidgzip without the Python wrapper) exited via a signal without a valid return code. In that case, bash seems to set the return code to 128 + signal code.
Python seems to overwrite the signal handler for SIGPIPE and effectively ignores it. Without the Python wrapper, the call to vmsplice
on a broken pipe does not even return to the line of code thereafter. When the signal is ignored, it returns with errno
set to EPIPE
.
The current solution propagates the error and effectively returns with a similar exit code. Alternatively, I could have overwritten the signal handler but it feels wrong because who knows how hard Python depends on its overwritten signal handler for this. I'm not sure whether this difference can be observed from outside or whether the return code is the only observable. It seems like the wait
syscall has this information but bash does not expose it and instead only constructs a special return code as stated above.
Some further interesting considerations for SIGPIPE.
The argument about embracing lazy evaluation made me discard my argument with --export-index
, i.e., an index will not be exported on a broken pipe. Currently, this is by design.
Personally, I have no strong opinion about this yet. I see that the backtrace is confusing and implies a more problematic error than is actually the case. Therefore, I'll only guarantee that this won't happen. As for what exact error code or the exact exit method being used, I'll leave this open to change in the future (this is not even a 1.0 version yet).
works like charm. thank you.
the exit code 141 is very likely shell-dependent
Actually not. As far as I know, the kernel sends SIGPIPE to the process which continues writing to a buffer, for which no readers exist any more. My guess is, that vmsplice
receives the SIGPIPE and terminates by throwing EPIPE
error, which bubbles up in the Python code and has to be catch there accordingly. Thought, I am not a Python developer..
Hi,
I am using rapidgzip as a drop-in replacement for gunzip, pigz respectively. Whereas the latter tools exit with 141 upon SIGPIPE without further messages, rapidgzip exit code is 1 and prints an error stack trace. May I suggest to silence rapidgzip and exit with 141 as well?
Example:
seq 1 10000 | gzip -c | gzip -d | head; echo ${PIPESTATUS[*]} # 0 0 141 0
vsseq 1 10000 | gzip -c | rapidgzip -d | head; echo ${PIPESTATUS[*]} # 0 0 1 0