ltgcgo / incubator

🥚 Where ideas come to life.
Creative Commons Attribution Share Alike 4.0 International
0 stars 0 forks source link

Brandy Breeze - Bzip respin #7

Open PoneyClairDeLune opened 7 months ago

PoneyClairDeLune commented 7 months ago

Brandy Breeze: 🍾 General-purpose compression algorithm.

Built on top of RLE, BWT, MTF(?) and LZ77/LZSS. Aims at offering a balance between Brotli's speed and XZ's efficiency.

Because bzip3 (.bz3) is already used as a name by others, the extension would probably be .brz. The stream format would be backwards compatible with bzip2, but either cannot compress or decompress the other.

General design:

  1. Run-length encode, 3-byte trigger. (?)
  2. Burrows-Wheeler transform.
  3. Move-to-front transform. (?)
  4. Run-length encode, 3-byte trigger. (?)
  5. LZ77/LZSS encode. If with LZSS, store 1-bit flags in KORG-like little-endian fashion.
PoneyClairDeLune commented 7 months ago

Current working draft: https://github.com/PoneyClairDeLune/brandy-breeze-js

PoneyClairDeLune commented 7 months ago
.magic:16                       = 'BZ' signature/magic number
.version:8                      = 'r' for Brandy Breeze
.hundred_k_blocksize:8          = '1'..'9', 'a'..'f' block-size 64 KiB-1 MiB (uncompressed)

.compressed_magic:32            = 'BZrd' for Brandy Breeze chunk data packet
.crc:32                         = checksum for this block (BE)
.size:24                        = size of this block (VLV 21-bit, BE)
.origPtr:24                     = start pointer into BWT for before untransform (VLV 21-bit, BE)
.data                           = compressed data

.eos_magic:32                   = 'BZre' for Brandy Breeze end of stream
.crc:32                         = checksum for whole stream

(Unfinished)

PoneyClairDeLune commented 4 months ago

Discovered a more space-efficient way of reversing BWT without ever building the supposed impossibly-humongous tables. Others might have already discovered this beforehand, but meh, I still spent some hours on this :P

  1. Create an array with all indexes from the input buffer array.
  2. Sort the index array with values from the input.
  3. Build a map/array with the sorted index array: map[e] = i
  4. Iterate through the map with the value gotten as the key of the next iteration, writing corresponding bytes from the input buffer to the output buffer.

Now onwards to a better BWT implementation...

PoneyClairDeLune commented 4 months ago

.brz operates at a maximum BWT block size of 1 MiB, so with the two index arrays (uint32), the space used to store the index maps would take about 8 MiB. An additional MiB is used to store the LZSS sliding window, with yet another additional MiB used to store the quick LZSS fast pointer lookup table (16bit + max 16 options).

The modified KORG encoding could prove quite useful here.

PoneyClairDeLune commented 4 months ago

Discovered a more space-efficient way of reversing BWT without ever building the supposed impossibly-humongous tables. Others might have already discovered this beforehand, but meh, I still spent some hours on this :P

1. Create an array with all indexes from the input buffer array.

2. Sort the index array with values from the input.

3. Build a map/array with the sorted index array: `map[e] = i`

4. Iterate through the map with the value gotten as the key of the next iteration, writing corresponding bytes from the input buffer to the output buffer.

Now onwards to a better BWT implementation...

https://github.com/senkevinli/burrows-wheeler-transform Yup, someone already discovered.