madler / zlib

A massively spiffy yet delicately unobtrusive compression library.
http://zlib.net/
Other
5.71k stars 2.45k forks source link

Questions about Large gz file handling (distributed processing) #767

Closed siliu-tacc closed 1 year ago

siliu-tacc commented 1 year ago

Dear Mark and zlib developers,

I am working on some large gz uncompressed files right now (at least several hundred GB uncompressed file each) and I am trying to process them in a distributed (parallel) way. I have a few questions to confirm with you.

1) I read RFC 1952 online and see that each block is started with some identifiers like. ID1 = 31 (0x1f, \037), ID2 = 139 (0x8b, \213)

How do you choose these identifiers here? I assume that this identifiers "1f8b" of "1f8b08" will not appear in the compressed block (like inside the Huffman code) but would like to confirm it and know more details about these choices (so that I may be able to partition the large, compressed files block by block without problem).

2) If these identifiers only appear in the header, I wonder if I may process a large gz file block by block in a distributed/independent way. In your zran example, you go through the whole data files and build and index for the whole file for future random access, which is great.

But that building index process requires going through the uncompressed file and inflate/decompress the files at the first round. I wonder if it is possible to go through the file (without decompression) and only search for the block headers in the compressed file (e.g. "1f8b08") to find the start/end location of each block. Then I may work on each block and inflate them separately/independently.

Thank you so much, Si

madler commented 1 year ago

I think you mean "large gz compressed files".

To make sure we get our terminology straight, as noted in RFC 1952, what you are calling "blocks" are referred to as members. (RFC 1952 also refers to an entirely different thing called blocks, which are deflate blocks, described in more detail in RFC 1951.)

Yes, each member starts with 1f8b08. However almost all gzip files consist of a single gzip member. Unless your large gzip files were created in a special manner to either generate multiple members or by concatenating smaller gzip files, then each of your files is a single member.

No, the sequence 1f8b08 most definitely can appear in the compressed data. It will appear by chance every 16MB or so, on average, with a Poisson distribution. Compressed data can be any sequence of bytes.

Yes, if your file has many gzip members (which it probably doesn't), then you can search for 1f8b08, or more if you know something about the gzip headers in your files, and start decompressing from those points. You will get false positives, which should be detected as such in short order by the inflator (very likely within a few 10's of K bytes or less), and can then be discarded. The true positive members can be successfully decompressed independently from each other.

siliu-tacc commented 1 year ago

Hi Mark,

Thank you for the quick response and all the information. I have two more quick follow up questions.

1) I did see some statements like "almost all gzip files consist of a single gzip member" before, but did not pay much attention earlier. Is there any way I can use to confirm how many members are there for each gzip file? I did not generate those large gz files by myself and may not be able to see how there were generated. But I guess there should be some ways to confirm the member numbers. I originally counted the ID "1f 8b 08" but the results must contain the "false positive" ones already...

2) If each gz file only consists of a single member, then there is nothing to do with the random access (to any specific member) as I originally planned. Do you have any recommendation (besides the building index way in your zran example) that I may consider to process a huge gz file more efficiently (like in a parallel/distributed way)?

Thank you so much, Si Liu

madler commented 1 year ago
  1. pigz -ltv giant.gz will show the individual members of gzip file. If there is one line of output after the header, then there is one member.
  2. Either building an index, or recompressing it to have multiple members.
siliu-tacc commented 1 year ago

Thanks a lot, Mark. Your help is greatly appreciated.

Best wishes, Si Liu