tc39 / proposal-binary-ast

Binary AST proposal for ECMAScript
964 stars 23 forks source link

Optimizing booleans, numbers, etc. #26

Open Yoric opened 6 years ago

Yoric commented 6 years ago

For the moment, literal numbers and booleans are represented by

interface LiteralInfinityExpression : Node { };

interface LiteralNumericExpression {
  attribute double value;
};

interface LiteralBooleanExpression : Node {
  attribute boolean value;
};

This typically means that a boolean will be stored as 2 bytes and a number other than infinity as 9 bytes. The latter is particularly odd, since the vast majority of numbers are 0 and 1. So, in AST v2, I had success decreasing the size of files by introducing special literal values for 0, true and false.

One way of doing this would be to introduce in the grammar

interface LiteralInfinityExpression : Node { };
typedef (LiteralZeroExpression or
  LiteralOneExpression or
  LiteralDoubleExpression)
 LiteralNumericExpression;

interface LiteralDoubleExpression {
  attribute double value;
};
interface LiteralZeroExpression {};
interface LiteralOneExpression {};

typedef (LiteralTrueBooleanExpression or LiteralFalseBooleanExpression) LiteralBooleanExpression

interface LiteralTrueExpression : Node { }
interface LiteralFalseExpression : Node { }

Similarly, it seems odd to reserve one byte for UpdateExpression to determine whether it's a prefix or a postfix expression, so we could rewrite

typedef (PrefixUpdateExpression or PostfixUpdateExpression) UpdateExpression;

interface PrefixUpdateExpression : Node {
  attribute UpdateOperator operator;
  attribute SimpleAssignmentTarget operand;
};

interface PostfixUpdateExpression : Node {
  attribute UpdateOperator operator;
  attribute SimpleAssignmentTarget operand;
};

Admittedly, introducing a change in the AST solely for the sake of compression sounds a bit odd. An alternative would be to either trust compression (but this didn't seem to work that well in AST v2) or somehow make the (de)tokenizers smart enough to introduce the above changes transparently (note to self: the deanonymizer would certainly be smart enough to do this).

Starting the conversation here.

ljharb commented 6 years ago

since the vast majority of numbers are 0 and 1

Can you elaborate on this claim?

Yoric commented 6 years ago

I ran statistics on a corpus of js files (mostly the source code of the Facebook chat) and if I recall correctly, about 50% of the number literals were 0, with 1 being half of the rest. For the moment, these statistics are buried in a more complicated tool, but I'll try and extract a simple tool to reproduce the experiment.

j-f1 commented 6 years ago

I did a quick test in AST Explorer on the jQuery codebase, and got this result:

number  occurrences
0       277
1       206
2       48
3       38
4       16
9       14
11      11
5       9
8       5
0.5     4
200     4
6       3
36      3
65536   2
10      2
16      2
304     2
204     2
55296   1
1023    1
56320   1
50      1
7       1
0.1     1
20      1
12      1
13      1
600     1
400     1
300     1
1223    1
404     1
Transform code to put in after selecting `babel7` from the transform panel ```js export default function(babel) { const { types: t } = babel const nums = new Map() return { visitor: { NumericLiteral(path) { const { node: { value } } = path if (!nums.has(value)) { nums.set(value, 1) } else { nums.set(value, nums.get(value) + 1) } }, Program: { exit(path) { path .get("body") .slice(1) .map(p => p.remove()) path.get("body")[0].replaceWith( t.identifier( [...nums.keys()] .sort((a, b) => nums.get(b) - nums.get(a)) .map(k => k + " ".repeat(8 - ("" + k).length) + nums.get(k)) .join("\n") ) ) } } } } } ```