Closed nschuessler closed 1 year ago
So it appears this is a multi-member gzip format and requires MultiGzipDecoder
.
Sorry for the late reply, and thanks for sharing!
We are currently working on improving the documentation around the usage of GzDecoder
and MultiGzDecoder
in the hopes that this will be less of a problem in future.
Closing, as this PR is not directly actionable.
In trying to decode the common crawl index files.
GzDecoder
stops at about 1.8M of input of a 690M file. The file is too large to use.read_to_end
(i.e. read it into memory).If you download the file and use
gzip -d cdx-00010.gz
the whole file is expanded. How do you useGzDecoder
to get the same behavior asgzip -d
?The code exits early because
decoder.Read
returns 0 bytes, whereas reading from the stream (input_stream.Read
) will continue. So, I assume there is some format issue in the file thatGzDecoder
does not handle andgzip
does. It prints 'Read 0 x' before exiting so I assume there are no errors.Thanks
Example input: https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2023-06/indexes/cdx-00010.gz
Example code: