Open Yoric opened 6 years ago
since the vast majority of numbers are 0 and 1
Can you elaborate on this claim?
I ran statistics on a corpus of js files (mostly the source code of the Facebook chat) and if I recall correctly, about 50% of the number literals were 0, with 1 being half of the rest. For the moment, these statistics are buried in a more complicated tool, but I'll try and extract a simple tool to reproduce the experiment.
I did a quick test in AST Explorer on the jQuery codebase, and got this result:
number occurrences
0 277
1 206
2 48
3 38
4 16
9 14
11 11
5 9
8 5
0.5 4
200 4
6 3
36 3
65536 2
10 2
16 2
304 2
204 2
55296 1
1023 1
56320 1
50 1
7 1
0.1 1
20 1
12 1
13 1
600 1
400 1
300 1
1223 1
404 1
For the moment, literal numbers and booleans are represented by
This typically means that a boolean will be stored as 2 bytes and a number other than infinity as 9 bytes. The latter is particularly odd, since the vast majority of numbers are 0 and 1. So, in AST v2, I had success decreasing the size of files by introducing special literal values for
0
,true
andfalse
.One way of doing this would be to introduce in the grammar
Similarly, it seems odd to reserve one byte for
UpdateExpression
to determine whether it's a prefix or a postfix expression, so we could rewriteAdmittedly, introducing a change in the AST solely for the sake of compression sounds a bit odd. An alternative would be to either trust compression (but this didn't seem to work that well in AST v2) or somehow make the (de)tokenizers smart enough to introduce the above changes transparently (note to self: the deanonymizer would certainly be smart enough to do this).
Starting the conversation here.