Closed JobLeonard closed 8 months ago
It's currently just base64-encoded with a URL-safe encoding. I don't have the bandwidth to devote to this right now, but I'd welcome contribution.
I should note that a solution should probably not assume anything based on Uiua's glyphs, as the glyphset is still in flux.
Well, it happens that I'm into compression algorithms as a hobby and have a small personal toolkit of "compression building blocks" for JS-based experiments, so I could distill something out of that without too much mental puzzling. I just need to find the time.
Basically, I have the mental bandwidth but not the "work hours" available for a full PR. Would writing a specialized compression/decompression function in JavaScript be a useful contribution? With comments so you can make sense of it. And then leave the part of wiring them up into the rest of the website up to you? Also even writing those functions might take a bit.
I should note that a solution should probably not assume anything based on Uiua's glyphs, as the glyphset is still in flux.
Oh that will not be an issue. I'm thinking of a simple streaming dictionary compression that starts with an empty dictionary. It will grow as the string is processed and new glyphs are encountered, having no knowledge about which alphabet is in the string before doing so. Will spare you the details for now because bandwidth - guess you can read them as code comments later if those JS functions would be worth writing up.
Would writing a specialized compression/decompression function in JavaScript be a useful contribution?
Because the frontend is written in Rust, I'm not actually sure how easy it would be to integrate arbitrary JS functions.
Ah, I have never written anything in Rust. Just getting over that threshold would already slow me down quite a bit :sweat_smile:. Well, I'll do the JS functions first and worry about that later.
@JobLeonard If you ever finish that but don't have time for more, ping me! I'd be interested in translating the JS to Rust. I use TS/JS at my day job and have only used Rust for a few projects so far, this seems like an interesting challenge and a good learning experience.
@iFreilicht awesome! Looking forward to collaborating.
Some mental notes for myself after thinking about what would be the most cost-effective "uiua-oriented" compression approach (before I forget my conclusions again):
#
is encountered, switch to an LZ78 variant (perhaps even separate dictionary) until a newline character is encountered, and otherwise just use a simple "encountered glyph" dictionary.A good case-study might be this ray-tracer posted by Bren077s on the discord.
The link is very long and, I quote:
Had to trim some features and all comments to make the url small enough for discord to accept it.
You can see there are a few repeated glyph pairs, 11 time ⊙(
or 7 times +×
for example. Binding names are also repeated quite often, so simple dictionary approaches to reduce byte-count per symbol should be quite effective, as you said.
I did some local testing of compression (gzip (lz77)
, bzip2
, xz (lzma)
, zstd
, brotli
) on the lastest version of my ray tracer source. All algorithms and compression levels performed about the same. Compression on my ray tracer varied from 1.79x
to 2.06x
.
The best compressor at min compression level was bzip2
with a 1.97x
. The best compressor at standard compression level(6) was gzip
with a 2.01x
. The best compressor at max compression level was brotli
with a 2.06x
.
I then did the exact same test again, but removed the comments and most of the extra empty lines. So overall, removed some redundancy and english text. Compression varied from 1.67x
to 1.95x
.
The best compressor at min compression level was bzip2
with a 1.67x
. The best compressor at standard compression level(6) was gzip
with a 1.92x
. The best compressor at max compression level was brotli
with a 1.95x
.
Overall, it feels that exact compression algorithm does not matter too much. In general, we may be able to get a 1.5x
to 2x
ish size reduction of larger text.
context:
consider these:
:. did you consider testing compression on ascii-uiua & comparing it with compression results on glyphed-uiua??
i am interested to see if in case of compressong ascii-uiua
would be compensated by;
3. future optimisation scope: compresssion-targetted formatter (a way to convert glyphs back into name #60) to output "minimal-length unambiguous head" of ascii-name/symbol
edit: okay, censidering 3 byte unocode:
i'd like to expand on this:
%*`
!<=>
id
pi
,# unambiguous to human too
not abs sqr sin
mod pow log min max
len rev box sel rot
win mem rep dip gap
inv try tau eta inf
# not so unambiguous to humans
ind row dis tab
flo cei rou ata sha
fir des bit tra ris fal
cla ded unb mat cou joi
pic tak dro kee fol eac
cro gro par bot bra for
und lev fil ass spa wai
bre rec
@goyalyashpal
unicode takes 4 times as much as ascii for single character (right?)
The prevalent encoding these days is UTF-8, which is a variable-length encoding. In this encoding, a single codepoint can be up to 4 bytes long, but for the majority this is not the case. Most glyphs in Uiua are between code points U+0800 and U+FFFF, so they take exactly 3 bytes to encode.
Most glyphs in Uiua are between code points U+0800 and U+FFFF
~~a more "strict" or tight range (for now at least) seems to be U+2190 to U+23FF i.e. "Arrows", "Mathematical Operators", "Miscellanious Technocal"~~
Ok, I really appreciate the effort and how everyone is thinking about this, but this has basically been solved by people smarter than us a couple of decades ago :sweat_smile:.
To explain why this isn't going to be an issue basically requires a partial crash course in the essence behind the LZ78 family of compressors, please bear with me (:sweat_smile:, again).
First a crucial mathematical insight: to write down a (positive non-zero integer) number n
in binary you only need min(log2(n)) + 1
bits:
number (base ten) | number (binary) |
---|---|
0 | 0 |
1 | 1 |
2 | 10 |
3 | 11 |
4 | 100 |
7 | 111 |
8 | 1000 |
15 | 1111 |
16 | 10000 |
31 | 11111 |
32 | 100000 |
... | ... |
(Tangent: the log-scaling behavior is a result of positional notation and true for any base (yes, that includes negative bases))
So when we know the difference between the minimum and maximum value in a list of nrs, we can minimize how much bits we need to store them. Example: if all values are integers between [0, 255], we only need 8 bits to store per number. You probably regularly do this kind of optimization in your programs yourself. (advanced trick: when the number range isn't a neat power of two, you can actually use encodings like phase-in codes to use close to log2(n)
bits for a uniform distribution of numbers (see also here for an attempt at explaining phase-in codes by me))
Look closely at any (lossless) compression scheme out there, and you will realize that this is actually a crucial implicit building block in all of them. It always boils down to "how can we map the patterns in this data to number representations that give us that sweet, sweet log(n) scaling behavior". Weirdly enough almost no data compression book bothers to mention that. (Tangent: it might be because Arabic numerals are positional notation base 10. So basically every literate person "knows" the most fundamental lossless data compression method without realizing it, and maybe that's why it's taken for granted)
Now to turn this insight into a data compression scheme.
One approach is to create a dictionary (really just a look-up table) that only contains known characters in our input string. Bren077's raytracer has only 81 unique characters, for example. With a dictionary containing only those letters we could compress the string by storing the indices for each letter in it, which needs roughly 6.4 bits per character with phase-in codes (ignoring actual distribution of letters).
You might ask "ok, but then we need to send over the dictionary, which also takes up space, no?"
Here is how to work around that: start with a dictionary with one entry: a special new character
escape token. If we are decoding our compressed bits and encounter it, we know that we have to read the next n
bits as a character literal (n
depends on character encoding and character) and add the character to reconstructed string and our dictionary. After that we can use its dictionary entry. So on top of sending over a literal once, we only pay an overhead cost of one escape token.
The LZ78 algorithm then expands on this concept by not just limit itself to single characters, but also having a way to build a dictionary of longer substrings from previously seen text, based on the assumption that it will likely encounter them again. This is already a mountain of text so I'm going to skip explaining the full algorithm for that.
What's more important here is the uiua glyphs. There are currently about 80 unique uiua glyphs (not including the ASCII ones). We could pre-load all of them in the dictionary, but most snippets use only a handful of those - we'd waste precious bits encoding "index i
out of 80" instead of "index i
out of [tiny handful]". So instead we can create a second dedicated uiua glyph dictionary, that we can access with a second escape code: new uiua glyph
. Then that would have an overhead of about 7 bits per newly encountered glyph in the input plus escape code. This is much less than the 24 or 32 bits for the full UTF literal. And after that it would only need as much bits as log2(main dictionary size).
As long as we make this glyph dictionary an append-only list (meaning any new glyphs added to the language are added at the end, and if they are removed from the language they stay in it the list), we can both update this table for new uiua verions while keeping the compressions scheme backwards-compatible to the previous version: just start the compressed bitstream with a nr encoding the uiua glyph dictionary size (which is effectively the compression scheme's semver). Even if we'd cycle through so many glyph shapes that the list is 255 glyphs long, that still just requires 8 bits to load a uiua glyph not previously encountered (because of that sweet, sweet log(n) scaling behavior). Go up to a thousand glyphs and you still only require 10 bits.
So yeah, we're probably fine there.
Phew, that was a lot. Hope it was clear enough to follow.
I just saw this in the discord:
https://uiua.org/pad?src=Z2NkIOKGkCA74o2lKOKOiz0wLuKXvyzihpLihqfihqUsLCniiJ4KYXJyR2NkIOKGkCBnY2TiiqLiiLbiiqLih4wuCmNvbXBsZXhTcXVhcmUg4oaQIFvihpIow5fDlzIpLcOXLuKItsOXLiws4oqi4oi24oqi4oeMLl0KcGFpcnMg4oaQIC_iioLiip7iioIuKzHih6EKc29ydCDihpAg4oqP4oyCLgoKdHJpcGxldHMg4oaQICgKICBwYWlycwogIOKWveKJoSg84oqi4oi24oqi4oeMLikuICMgZmlsdGVyIGZvciBvbmx5IHNvcnRlZCBwYWlycwogIOKJoShjb21wbGV4U3F1YXJlKSAjIGdldCB0cmlwbGV0cwogIOKWveKJoSg9MWFyckdjZCkuICMgZmlsdGVyIG91dCByZXBlYXRzCiAg4omhKOKKguKItuKImi8r4oG_Mi4pICMgZ2VuZXJhdGUgdGhpcmQgbnVtYmVyCiAgLiAjIHNhdmUgZm9yIGxhdGVyCiAg4omhKC8rKSAjIHN1bQogIOKGkuKItiAjIGJyaW5nIGNvbnN0YW50IGZvcndhcmQKICDiiLoow7cpICMgZmluZCByYXRpb3MKICA9MOKXvzEuICMgYnVpbGQga2VlcCBhcnJheQogIOKGkuKWvS4gIyBrZWVwIHJhdGlvCiAg4oaS4pa94oi2ICMga2VlcCB0cmlwbGV0CiAgw5cgIyBtdWx0aXBseSByYXRpbyBhbmQgdHJpcGxldAogIOKJoSgvw5cpICMgcmVkdWNlIGludG8gcHJvZHVjdAopCgp0cmlwbGV0cyAzMCAxMjA=
That's 844 characters for the base 64 portion of the URL, and it's not that big of a snippet. I just wanted to mention that if you throw the same source code through lz-string's url-safe compressor, it gives a string of 636 characters.
I assume that array languages generally do not compress very well in the "repeated substring" sense, but the thing is that lz-string also compresses by limiting itself to characters actually in the dictionary. Since there are only 68 unique characters in this linked example (and that's because of the amount of comments), this already saves about two bits per char compared to a regular UTF-8 conversion.
I'm not sure how one might add this in a backwards compatible way (also a variant algorithm more dedicated to uiua's particular "text distribution" might be more appropriate, let me know if you want me to look into that)