mxmlnkn / rapidgzip

Gzip Decompression and Random Access for Modern Multi-Core Machines
Apache License 2.0
345 stars 7 forks source link

s3 stdin support #34

Closed tooptoop4 closed 4 months ago

tooptoop4 commented 4 months ago

this works: aws s3 cp s3://redact/redact.csv.gz - | gunzip -c | wc -l

this gives error "Either stdin must have input, e.g., by piping to it, or an input file must be specified!": aws s3 cp s3://redact/redact.csv.gz - | rapidgzip -d -P 1

mxmlnkn commented 4 months ago

Which version are you using (rapidgzip --version)? I think I fixed some issues with stdinHasInput in more recent versions. I currently don't have a test URI for aws s3, but it works fine with cat file.gz | rapidgzip -d -P 1. Does it work with cat for you, i.e., is it really an issue only with aws s3?

mxmlnkn commented 4 months ago

Ok, I remember the issue I had recently, and it also applies here. The problem is that stdinHasInput has no surefire way to detect a pipe to another program and simply waits for input with a timeout. I can reproduce your error with:

( sleep 1; cat base64-8MiB.gz ) | rapidgzip -d -P 1

This works with gzip, so I could take a look how piped input is detected there.

mxmlnkn commented 4 months ago

Here is a workaround; rapidgzip -d -P 1 -c <( sleep 1; cat file.gz ). Note that parallelization should also work with this (and pipes) since rapidgzip 0.11.0.

tooptoop4 commented 4 months ago

i'm on latest version, only facing issue with s3

mxmlnkn commented 4 months ago

I have pushed a fix. stdinHasInput now uses isatty instead of poll with a timeout of 100ms. You can try it out with:

python3 -m pip install --force-reinstall 'git+https://github.com/mxmlnkn/indexed_bzip2.git@master#egginfo=rapidgzip&subdirectory=python/rapidgzip'

Or by building from source with CMake. Note that the up to date unreleased state is only available on https://github.com/mxmlnkn/indexed_bzip2 .

tooptoop4 commented 4 months ago

when will this be released? @mxmlnkn

mxmlnkn commented 4 months ago

when will this be released? @mxmlnkn

It is released in rapidgzip 0.10.5. I did not do a bugfix release for version 0.11 and 0.12. If you want to use it now, you can downgrade to 0.10.5 or install from the source:

python3 -m pip install --force-reinstall 'git+https://github.com/mxmlnkn/indexed_bzip2.git@master#egginfo=rapidgzip&subdirectory=python/rapidgzip'

I didn't really want to do a bugfix release for 0.12 and I wanted instead to include it in 0.13, but that can take some more weeks. I guess if you really want a released version, I'll have to do a 0.12.2 release.

mxmlnkn commented 4 months ago

Release 0.13.0 is out.

tooptoop4 commented 4 months ago

any changelog for the release? i can't see tag in this repo

mxmlnkn commented 4 months ago

https://github.com/mxmlnkn/rapidgzip/releases/tag/rapidgzip-v0.13.0