kubo / snzip

Snzip, a compression/decompression tool based on snappy
Other
216 stars 30 forks source link

Question about matches in snzip tool #30

Open shulib opened 1 year ago

shulib commented 1 year ago

Hi, Do you support window size for match offset > 64k when packet is greater? what are the parameters I should insert to do that I run snzip tool version 1.0.4 modes I run: framing2 and framing Thanks;

shulib commented 1 year ago

should I use snzip to compress when window size is greater then 64k (match offset?) how? thanks,

kubo commented 1 year ago

I'm not sure what is window size you wrote. If it is kBlockSize, you should ask google snappy mailing list.

shulib commented 1 year ago

I create the flowing file data to compress: part a: 64k bytes random digits for example:71451745376545378 part b 128k: block of 1 digit part c 64k: I copied part a to part c After compress it by tool, I expect to see that the starting block of part c supposed to be match with offset 192k, but I compared part C compression to part a Compression results and they same.

kubo commented 1 year ago

Could you post concrete explanation? Did you compress a file containing three parts? Could you post your data to gist and post what you did (by commands you executed, not by words) here?

shulib commented 1 year ago

ok, out.txt I took this file and run snzip.exe -t framing2 -k out.txt I translate out.txt.sz to hex format and compared the blocks new line per digit - expected to match in from character number 61066 and see that you implemented it as literals section.

shulib commented 1 year ago

character on snzip file

kubo commented 1 year ago

I haven't got your question yet. Your explanation is unclear.

I translate out.txt.sz to hex format

I got it until here. You did something similar to the following command.

od -t x1z out.txt.sz > out.txt.sz.hex  # od is a command line tool on linux

and compared the blocks new line per digit - expected to match in from character number 61066 and see that you implemented it as literals section.

I'm not sure what you did. Could you post what you expected with more details and what you see? What is "literals section"?

shulib commented 1 year ago

I expect to match snappy sequence in the first sequence of last block match mode: 3 match offset: 192k match length: 0x40 instead of that I get a literal sequence. 0xf4 ...

kubo commented 1 year ago
  1. Could you post what you see with your own eyes, not what you interpreted? Without it, I cannot understand your interpretations. Even when you and I see same thing, you and I may interpret it differently.
  2. Could you post out.txt.sz compressed by your snzip? If your snappy library version used by snzip is different from mine, the output may differ slightly.

I want post similar with the following. If you cannot copy and paste hex dump as text, paste images instead.


Head of hex data dumped by od -t x1z -A x out.txt.sz > out.txt.sz.hex;

000000 ff 06 00 00 73 4e 61 50 70 59 00 57 d6 00 11 1c  >....sNaPpY.W....<
000010 33 c7 80 80 04 f4 8d 0a 39 30 32 35 39 33 35 31  >3.......90259351<
000020 35 35 39 39 39 33 37 33 32 38 35 35 39 35 38 31  >5599937328559581<
000030 32 37 37 37 32 32 31 36 36 37 39 32 35 32 39 31  >2777221667925291<
000040 36 33 39 35 39 33 30 30 34 33 38 35 32 38 33 33  >6395930043852833<

line 3815-3820 of out.txt.sz.hex (byte offset 0x00ee60 - 0x00eebf of out.txt.sz)

00ee60 00 fe 01 00 fe 01 00 fe 01 00 fe 01 00 fe 01 00  >................<
00ee70 fe 01 00 fe 01 00 fe 01 00 fe 01 00 fa 01 00 00  >................<
00ee80 57 d6 00 11 1c 33 c7 80 80 04 f4 8d 0a 39 30 32  >W....3.......902<
00ee90 35 39 33 35 31 35 35 39 39 39 33 37 33 32 38 35  >5935155999373285<
00eea0 35 39 35 38 31 32 37 37 37 32 32 31 36 36 37 39  >5958127772216679<
00eeb0 32 35 32 39 31 36 33 39 35 39 33 30 30 34 33 38  >2529163959300438<

I interpreted it as:

The stream identifier (chunk type 0xff) starts at offset 0x000000. The chunk data size is 0x000006. The total chunk size is 4 + 0x000006 = 0x00000a. The first compressed data (chunk type 0x00) starts at 0x00000a. The chunk data size is 0x00d657.

The first chunk of part c starts at 0x00ee7f. It is a compressed data chunk. The subsequent bytes looks same with that of the first compressed data at 0x00000a.

00ee70 fe 01 00 fe 01 00 fe 01 00 fe 01 00 fa 01 00 00  >................<
                                                    ``-- part c starts here.
shulib commented 1 year ago

see line 00ee80 block starts on byte 7 80 80 04 (part c starts) after that you got: f4 8d 0a 39 30 32 ... it is the same to line 000010 byte 5 you get the same bytes why is it not match sequence?

kubo commented 1 year ago

I finally got your question now.

Do you support window size for match offset > 64k when packet is greater? what are the parameters I should insert to do that

No parameters. The snappy library divides input data into 64k blocks(1). Each block is compressed separately(2). Byte sequences in a block cannot be encoded as match of that in previous blocks.

1: https://github.com/google/snappy/blob/1.1.9/snappy.cc#L1477-L1529 2: Table for compression is cleared for each block. https://github.com/google/snappy/blob/1.1.9/snappy.cc#L713

To increase the block size, you need to change not only snzip but also snappy in order to handle offset more than 16-bit as described here.