pieroxy / lz-string

LZ-based compression algorithm for JavaScript
MIT License
4.13k stars 569 forks source link

Is there a way to allow custom dictionaries? #169

Closed wll8 closed 1 year ago

wll8 commented 1 year ago

For example, add a parameter: dict, when the value is ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789!#$%&()*+,./:;<=>?@[]^_`{|}~", it means use base91 compression.

see: #90

It looks like base91, but it is not the standard base91 algorithm, but only uses the characters of these sets as output.

When the value is hello lz, it means that only some combinations of these characters will be included in the compressed text.

pieroxy commented 1 year ago

Hello. I am not sure what your question is or what you are trying to achieve. Could you maybe explain a bit more ?

wll8 commented 1 year ago

It seems that this is a function of converting a string to only contain certain characters specified.

Example: Pseudocode

lz.compress(`hello`, {
   map: `0123456789ABCDEF`,
});

The output of the above code is: 68656C6C6F, they do not contain characters outside the map.

Can lz-string provide a way to alleviate the size increase by using the lz algorithm during the conversion process?

Here are some more examples:

pseudocode: looks like morse code

lz.compress(`hello`, {
  map: `. -`,
});

// .... . .-.. .-.. ---

pseudocode: looks like Base16

lz.compress(`hello`, {
  map: `0123456789ABCDEF`,
});

// 68656C6C6F

pseudocode: Array

lz.compress(`hello`, {
  map: ["公正", "爱国", "平等", "诚信", "文明"],
});

// 公正爱国公正平等公正诚信文明公正诚信文明公正诚信平等

pseudocode: Array

lz.compress(`hello`, {
  map: `👨👩👧`,
});

// 👨👨👧👨👨👧👨👧👩👨👧👨
pieroxy commented 1 year ago

Can lz-string provide a way to alleviate the size increase by using the lz algorithm during the conversion process?

This is very exactly what LZ-String is doing. It is using the lzw algorithm to compress the input, using the symbols in the array to store the bits the algorithm produced. So in effect, lz.compress('hello', {map: '0123456789ABCDEF'}); does not look like base16, it is a base16 representation of the compressed stream.

wll8 commented 1 year ago

You have provided the following functions, but none of them meet the requirements, because in my use case only specified characters are allowed to appear in the compression result.

Transmission of characters other than 12AB is not allowed in my use case.

Suppose I need to compress BBC, without lz-string, I might do this:

var userDict = `12AB`
var split = userDict.slice(0, 1)
var sysDict = userDict.slice(1)
var table = {
  A: sysDict.repeat(1), // 2AB
  B: sysDict.repeat(2), // 2AB2AB
  C: sysDict.repeat(3), // 2AB2AB2AB
}
var table2 = Object.entries(table).reduce((acc, [key, val]) => (acc[val]= key, acc), {})

// the string to be compressed
var str = `BBC`

// ['2AB2AB', '2AB2AB', '2AB2AB2AB'].join('1')
var compressed = str.split(``).map(str => table[str]).join(split)
// compressed = '2AB2AB12AB2AB12AB2AB2AB'

var decompressed = compressed.split(split).map(str => table2[str]).join(``)

However, I managed to control the encoding result within the range of 12AB, but failed to compress the data. How can I get help with lz-string?

pieroxy commented 1 year ago

Right, sorry about that, I thought you already looked into the code since you described exactly the way LZString works. If you look at the implementation of compressToBase64, you see this line:

LZString._compress(input, 6, function(a){return keyStrBase64.charAt(a);})

The _compress function takes three arguments:

So in your case you would call:

LZString._compress(input, 2, function(a){return "12AB".charAt(a);})

To do it in hexadecimal:

LZString._compress(input, 4, function(a){return "0123456789ABCDEF".charAt(a);})

Edit: Ah, and there is a corresponding _decompress function obviously. Let me know if you need any more assistance with that one.

wll8 commented 1 year ago

I saw the _compress method here a few days ago. Since the author raises some issues with transferring data, I think _compress is not the method I want. Now it looks like I was wrong.

Also, I think _compress is like a treasure trove. Worth showing in readme or homepage.

Although I don't understand why you need to pass a few extra parameters (whether it is possible to design a preset value?), or why _decompress should be charAt or charCodeAt, but I will try to understand it first.

Thanks for reminding me to use _compress again.

pieroxy commented 1 year ago

I am the author. It's compress that has issues because it generates invalid UTF-16 characters that some JS engines fail at storing and retrieving. _compress is the one doing the real job.

pieroxy commented 1 year ago

Although I don't understand why you need to pass a few extra parameters (whether it is possible to design a preset value?), or why _decompress should be charAt or charCodeAt, but I will try to understand it first.

The function you pass as an argument to _decompress has one job: Providing the meaningful bits in the input string. If it's hexadecimal, it needs to give out 0 for '0', 1 for '1' etc up to 15 for 'F'. That's its job, so it has to read the input string, hence the charAt. After that, you can do a switch, a bunch of if, a dictionary lookup (as in the decompressFromBase64 function) etc. It's really up to you.

HelloLudger commented 11 months ago

Hey I am trying to add a custom compress/decompress function. I fail to understand the needed number given for _decompress. For compress, the number of bits can be calculated as Math.ceil(Math.log2(dict.length)) But how do I calculate the resetValue? Math.ceil(dict.length/2) ? dict should be a string, e.g. "0123456789ABCDEF"