mxmlnkn / rapidgzip

Gzip Decompression and Random Access for Modern Multi-Core Machines
Apache License 2.0
364 stars 7 forks source link

suggestion: add handler for SIGPIPE #32

Closed koriege closed 8 months ago

koriege commented 9 months ago

Hi,

I am using rapidgzip as a drop-in replacement for gunzip, pigz respectively. Whereas the latter tools exit with 141 upon SIGPIPE without further messages, rapidgzip exit code is 1 and prints an error stack trace. May I suggest to silence rapidgzip and exit with 141 as well?

Example: seq 1 10000 | gzip -c | gzip -d | head; echo ${PIPESTATUS[*]} # 0 0 141 0 vs seq 1 10000 | gzip -c | rapidgzip -d | head; echo ${PIPESTATUS[*]} # 0 0 1 0

mxmlnkn commented 9 months ago

Which version are you using? Could you please downgrade to 0.11.1 and test that? I'm getting 0 0 0 0 for that version. I know the problem with the newer versions. I'm having some trouble correctly initializing rpmalloc in the myriad of use cases. Maybe I'll simply disable it until rpmalloc 2.0, which is easier to initialize, is released.

koriege commented 9 months ago

Oh sorry, my fault. I was still on 10.4 v0.11.0 and v0.11.1 are working as ~ expected (exit code 0)

mxmlnkn commented 9 months ago

Ah ok. The error on 0.10.4 is ValueError: This class heavily relies on seeking and won't work with unseekable files!. Support for non-seekable files is a new feature in 0.11.0.

koriege commented 9 months ago

sorry again for beeing unprecise, I tested multiple commands. beeing on v011.1, I get the following trace with an exit code 0

seq 1 100000| gzip -kc > foo.gz; rapidgzip -kcd foo.gz | head -1
1
error: 32
Traceback (most recent call last):
  File "/home/lakatos/kriege/programs/conda/envs/test/bin/rapidgzip", line 8, in <module>
    sys.exit(cli())
             ^^^^^
  File "rapidgzip.pyx", line 665, in rapidgzip.cli
RuntimeError: Failed to write to pipe
mxmlnkn commented 9 months ago

Ok, it seems to work just fine. You could pipe stderr to /dev/null to avoid the traceback output but I guess you don't want to do that because it is not necessary for gzip? I probably can catch that exception and suppress the traceback and it should work just fine.

koriege commented 9 months ago

this would be great. I am also wondering, why this trace does not occur for smaller files (seq to 10k instead of 100k) seq 1 10000 | gzip -kc > foo.gz; rapidgzip -kcd foo.gz | head -1

mxmlnkn commented 9 months ago

I probably can catch that exception and suppress the traceback and it should work just fine.

There are some conceptual problems, though, when it comes to combined command-line actions. For example, you can do: src/tools/rapidgzip -kcd --export-index index --count foo.gz> /dev/null | head and it probably should still generate the full index and maybe even return the correct count on stderr. However, without any of these additional options, we would want to stop decoding as soon as the output stream end has been encountered. Therefore, it seems like a simple try-catch will not work, and instead, some kind of flag is necessary.

I am also wondering, why this trace does not occur for smaller files (seq to 10k instead of 100k) seq 1 10000 | gzip -kc > foo.gz; rapidgzip -kcd foo.gz | head -1

Interesting question. This is because I'm using vmsplice, which is kind of asynchronous. The first call will work just fine, while the second or third call, depending on how much time has passed, will return the broken pipe error. On my system seq 1 14117 works but seq 1 14118 yields the exception.

> seq 1 14117 | gzip -kc > foo.gz; python3 -u -c 'import rapidgzip; rapidgzip.cli()' -kcd foo.gz | head -1 > /dev/null; echo ${PIPESTATUS[*]}
[writeAll] outputFileDescriptor: 1, dataToWriteSize: 73596
[writeAllSplice] outputFileDescriptor: 1, buffersToWrite count: 1
[SpliceVault::splice] buffersToWrite count: 1
[writeAllSpliceUnsafe] vmsplice( 1, i=0, segmentCount=1)
[writeAllSpliceUnsafe] nBytesWritten: 65408)
[writeAllSpliceUnsafe] Write remaining iovec. Currently at i=0 out of 1, nBytesWritten: 65408
[writeAllSpliceUnsafe] vmsplice( 1, dataToSplice.iov_len:8188)
[writeAllSpliceUnsafe] nBytesWritten:8188
0 0
> seq 1 14118 | gzip -kc > foo.gz; python3 -u -c 'import rapidgzip; rapidgzip.cli()' -kcd foo.gz | head -1 > /dev/null; echo ${PIPESTATUS[*]}
[writeAll] outputFileDescriptor: 1, dataToWriteSize: 73602
[writeAllSplice] outputFileDescriptor: 1, buffersToWrite count: 1
[SpliceVault::splice] buffersToWrite count: 1
[writeAllSpliceUnsafe] vmsplice( 1, i=0, segmentCount=1)
[writeAllSpliceUnsafe] nBytesWritten: 65408)
[writeAllSpliceUnsafe] Write remaining iovec. Currently at i=0 out of 1, nBytesWritten: 65408
[writeAllSpliceUnsafe] vmsplice( 1, dataToSplice.iov_len:8194)
[writeAllSpliceUnsafe] nBytesWritten:8192
[writeAllSpliceUnsafe] vmsplice( 1, dataToSplice.iov_len:2)
[writeAllSpliceUnsafe] nBytesWritten:-1
error: 32
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "rapidgzip.pyx", line 707, in rapidgzip.cli
RuntimeError: Failed to write to pipe
1 0

It looks a bit different when calling rapidgzip built with CMake (without the Python layer):

> seq 1 14118 | gzip -kc > foo.gz; src/tools/rapidgzip -kcd foo.gz | head -1 > /dev/null; echo ${PIPESTATUS[*]}
[writeAll] outputFileDescriptor: 1, dataToWriteSize: 73602
[writeAllSplice] outputFileDescriptor: 1, buffersToWrite count: 1
[SpliceVault::splice] buffersToWrite count: 1
[writeAllSpliceUnsafe] vmsplice( 1, i=0, segmentCount=1)
[writeAllSpliceUnsafe] nBytesWritten: 65408
[writeAllSpliceUnsafe] Write remaining iovec. Currently at i=0 out of 1, remaining bytes to write: 65408
[writeAllSpliceUnsafe] vmsplice( 1, dataToSplice.iov_len:8194)
[writeAllSpliceUnsafe] nBytesWritten:8192
[writeAllSpliceUnsafe] vmsplice( 1, dataToSplice.iov_len:2)
141 0
koriege commented 8 months ago

For example, you can do: src/tools/rapidgzip -kcd --export-index index --count foo.gz> /dev/null | head and it probably should still generate the full index and maybe even return the correct count on stderr.

I don't get your point here. This command will always generate the full index and terminates without receiving any signal by head, because head will never get any stdin. On the other hand, when the user is not redirecting to /dev/null, and head is going to kill the rapidgzip process, for sure also the index will be and should be corrupted as a result of what the user did.

The first call will work just fine, while the second or third call, depending on how much time has passed, will return the broken pipe error.

I see, but can't you simply exit the python main upon catching BokenPipeError and KeyboardInterrupt (SIGINT, or ctr+c)

mxmlnkn commented 8 months ago

I don't get your point here. This command will always generate the full index and terminates without receiving any signal by head, because head will never get any stdin. On the other hand, when the user is not redirecting to /dev/null, and head is going to kill the rapidgzip process, for sure also the index will be and should be corrupted as a result of what the user did.

My bad. Not sure where the >/dev/null came from. I meant simply to combine -c and --export-index, which is allowed. It will write the decompressed result to stdout and export the index after it has finished decompressing.

mxmlnkn commented 8 months ago

I have pushed a commit. You can try it out with:

python3 -m pip install --force-reinstall 'git+https://github.com/mxmlnkn/indexed_bzip2.git@master#egginfo=rapidgzip&subdirectory=python/rapidgzip'

Some further notes:

I am not that well-versed in POSIX but the exit code 141 is very likely shell-dependent because the program (gzip or even rapidgzip without the Python wrapper) exited via a signal without a valid return code. In that case, bash seems to set the return code to 128 + signal code. Python seems to overwrite the signal handler for SIGPIPE and effectively ignores it. Without the Python wrapper, the call to vmsplice on a broken pipe does not even return to the line of code thereafter. When the signal is ignored, it returns with errno set to EPIPE.

The current solution propagates the error and effectively returns with a similar exit code. Alternatively, I could have overwritten the signal handler but it feels wrong because who knows how hard Python depends on its overwritten signal handler for this. I'm not sure whether this difference can be observed from outside or whether the return code is the only observable. It seems like the wait syscall has this information but bash does not expose it and instead only constructs a special return code as stated above.

Some further interesting considerations for SIGPIPE. The argument about embracing lazy evaluation made me discard my argument with --export-index, i.e., an index will not be exported on a broken pipe. Currently, this is by design.

Personally, I have no strong opinion about this yet. I see that the backtrace is confusing and implies a more problematic error than is actually the case. Therefore, I'll only guarantee that this won't happen. As for what exact error code or the exact exit method being used, I'll leave this open to change in the future (this is not even a 1.0 version yet).

koriege commented 8 months ago

works like charm. thank you.

the exit code 141 is very likely shell-dependent

Actually not. As far as I know, the kernel sends SIGPIPE to the process which continues writing to a buffer, for which no readers exist any more. My guess is, that vmsplice receives the SIGPIPE and terminates by throwing EPIPE error, which bubbles up in the Python code and has to be catch there accordingly. Thought, I am not a Python developer..