Optimize scanner by checking characters in order of frecuency

ElianCordoba commented 2 years ago

🔍 Search Terms

scanner perf
character frequency
micro-optimization

✅ Viability Checklist

My suggestion meets these guidelines:

[x] This wouldn't be a breaking change in existing TypeScript/JavaScript code
[x] This wouldn't change the runtime behavior of existing JavaScript code
[x] This could be implemented without emitting different JS based on the types of the expressions
[x] This isn't a runtime feature (e.g. library functionality, non-ECMAScript syntax with JavaScript output, new syntax sugar for JS, etc.)
[x] This feature would agree with the rest of TypeScript's Design Goals.

⭐ Suggestion

While reading the source code of the scanner (mainly the scan function) I noticed that in the switch case to decide what token the current character is going to be could be optimised, since there are many checks in the first positions that would cover very obscure and rarely used characters, for example:

switch(ch) {
    ...
    case CharacterCodes.tab:                              // This is common
    case CharacterCodes.verticalTab:
    case CharacterCodes.formFeed:
    case CharacterCodes.space:                            // This is by far the most common spacing character
    case CharacterCodes.nonBreakingSpace:
    case CharacterCodes.ogham:
    case CharacterCodes.enQuad:
    case CharacterCodes.emQuad:
    case CharacterCodes.enSpace:
    case CharacterCodes.emSpace:
    case CharacterCodes.threePerEmSpace:
    case CharacterCodes.fourPerEmSpace:
    case CharacterCodes.sixPerEmSpace:
    case CharacterCodes.figureSpace:
    case CharacterCodes.punctuationSpace:
    case CharacterCodes.thinSpace:
    case CharacterCodes.hairSpace:
    case CharacterCodes.zeroWidthSpace:
    case CharacterCodes.narrowNoBreakSpace:
    case CharacterCodes.mathematicalSpace:
    case CharacterCodes.ideographicSpace:
    case CharacterCodes.byteOrderMark:

The rest of the characters are pretty much never seen

Data to validate the idea

I wrote this simple script that:

You prepare folders to scan
Goes through each folder finding every javascript and typescript file, including declarations.
Reads and counts the occurrences of every character
Generates a report

These are the results after running that script with the following input:

Angular
Vue
definitely-typed
node
react
Typescript

All charaters frecuency

Letters and digits are compacted

Character	CharCode	Occurrences
a - Z	---	283946350
	32	118426366
\n	10	12939620
0 - 9	---	12032791
/	47	7656952
.	46	5719973
,	44	5261314
*	42	4817435
(	40	4596610
)	41	4576496
:	58	4410275
;	59	4259587
\"	34	2987244
=	61	2972430
'	39	2243891
{	123	2129247
}	125	2127271
\r	13	1913084
-	45	1848648
>	62	1526829
_	95	1382751
\|	124	1185122
[	91	1165403
]	93	1143123
<	60	674655
`	96	666969
?	63	615641
+	43	593973
@	64	583870
&	38	317592
^	94	301090
\	92	273808
!	33	204297
$	36	176215
å	229	166608
#	35	146205
æ	230	109732
ç	231	87046
\t	9	79940
%	37	74606
è	232	73751
ä	228	53278
Ð	208	42002
à	224	38155
é	233	37039
â	226	30489
É	201	28294
ï	239	28069
ã	227	25070
×	215	20241
\u001b	27	20113
Ñ	209	17768
Î	206	17690
~	126	13372
Ï	207	8630
ì	236	8361
Ã	195	8231
Â	194	7939
Ø	216	6893
ë	235	5871
Ù	217	4212
ð	240	4032
í	237	2765
Å	197	2620
á	225	2444
ê	234	2276
\u0000	0	1674
Ä	196	1223
Û	219	1168
Ì	204	752
Õ	213	476
Ú	218	436
Ë	203	294
Ò	210	250
Ö	214	205
Í	205	190
Æ	198	166
î	238	136
Ó	211	113
Ç	199	98
Ê	202	88
È	200	77
Ô	212	74
\u0006	6	70
\u0001	1	41
\u007f	127	20
Þ	222	19
\u0004	4	10
\u0015	21	10
\u001f	31	10
ô	244	4
\u000e	14	3
\b	8	2
\u0019	25	2
\u0007	7	1
\u000b	11	1
\u0018	24	1
ò	242	1

Frequency of letters

Character	CharCode	Occurrences
e	101	34744603
t	116	25436137
n	110	19417462
r	114	19124734
o	111	18643411
a	97	18203564
i	105	17726833
s	115	16256937
l	108	11030026
c	99	10415858
d	100	9545849
u	117	8105944
p	112	8053582
m	109	6732462
h	104	6229093
f	102	5835848
g	103	4948635
y	121	3963501
b	98	3913809
v	118	3094000
x	120	2473019
T	84	2138187
w	119	2117379
S	83	2085778
E	69	1878189
C	67	1846989
A	65	1675336
k	107	1616486
I	73	1471091
P	80	1331507
R	82	1233493
D	68	1232785
O	79	1195486
N	78	1111634
F	70	1024793
M	77	954343
j	106	940131
L	76	913130
B	66	825556
H	72	531091
U	85	522822
q	113	508663
V	86	479602
W	87	473618
G	71	449814
z	122	369489
K	75	260218
J	74	220435
Y	89	183409
Q	81	172314
X	88	171264
Z	90	116011

Lowercase occurrences: 259447455
Uppercase occurrences: 24498895

Findings

Letters are by far the most common characters when counting them all together, and since in order to find identifiers you put them into the same bucket, that should be the first thing to check.
Lowercase letters are more than 10 times more likely to appear than uppercase ones.
As expected the whitespace (32) is by far the most common spacing character.

Optimizations to make

Sort the checks in the order corresponding with the occurrences found. I'm pulling up a PR that I've been working on. Spoiler alert, in my machine I didn't find much of a speedup, but it will be interesting to see the result of your perf suite.

MartinJohns commented 2 years ago

Did you actually profile the performance changes? As far as I know the switch is internally compiled to a map, so I would expect absolutely no performance improvements from re-ordering the cases.

ElianCordoba commented 2 years ago

As I mentioned at the end of the post I did bench it but didn't find any major difference. That's why I'm pulling up the PR to test it on a bigger scale to see if there is any improvement outside the noise.

In regards to the switch optimization not yielding more performance I wrote this very simple benchmark that shows some minor improvements.

RyanCavanaugh commented 2 years ago

Super curious to see the perf results. I'm not expecting much because the perf suite doesn't have the sort of enormous files that would be needed to show a significant delta.

RyanCavanaugh commented 2 years ago

The testcase you linked isn't representative at all, since it's using case char === 'a': instead of case 'a':

ElianCordoba commented 2 years ago

Oh wow, that's embarrassing, I coded a little bit too quickly. Here is a more correct benchmark to which I also added two switch true variants

const { Suite } = require('benchmark')
const suite = new Suite;

const data = 'assssaadaddaassdasdaassaaassaadaaaaassssaasaassasddaaaaasssaadaddaasasasassssaadaddaassdasdaassaaassaadaaaaassssaasaassasddaaaaasssaadaddaasasasassssaadaddaassdasdaassaaassaadaaaaassssaasaassasddaaaaasssaadaddaasasasassssaadaddaassdasdaassaaassaadaaaaassssaasaassasddaaaaasssaadaddaasasasassssaadaddaassdasdaassaaassaadaaaaassssaasaassasddaaaaasssaadaddaasasasassssaadaddaassdasdaassaaassaadaaaaassssaasaassasddaaaaasssaadaddaasasasassssaadaddaassdasdaassaaassaadaaaaassssaasaassasddaaaaasssaadaddaasasasassssaadaddaassdasdaassaaassaadaaaaassssaasaassasddaaaaasssaadaddaasasasassssaadaddaassdasdaassaaassaadaaaaassssaasaassasddaaaaasssaadaddaasasasassssaadaddaassdasdaassaaassaadaaaaassssaasaassasddaaaaasssaadaddaasasasassssaadaddaassdasdaassaaassaadaaaaassssaasaassasddaaaaasssaadaddaasasasassssaadaddaassdasdaassaaassaadaaaaassssaasaassasddaaaaasssaadaddaasasas'

suite.add('Unoptimized', function () {
  for (const char of data) {
    switch (char) {
      case '0': return;
      case '1': return;
      case '2': return;
      case '3': return;
      case 'a': return;
      case 's': return;
      case 'd': return;
    }
  }
})
  .add('Optimized', function () {
    for (const char of data) {
      switch (char) {
        case 'a': return;
        case 's': return;
        case 'd': return;
        case '0': return;
        case '1': return;
        case '2': return;
        case '3': return;
      }
    }
  })
  .add('Switch true', function () {
    for (const char of data) {
      switch (true) {
        case char == 'a': return;
        case char == 's': return;
        case char == 'd': return;
        case char == '0': return;
        case char == '1': return;
        case char == '2': return;
        case char == '3': return;
      }
    }
  })
  .add('Switch true strict', function () {
    for (const char of data) {
      switch (true) {
        case char === 'a': return;
        case char === 's': return;
        case char === 'd': return;
        case char === '0': return;
        case char === '1': return;
        case char === '2': return;
        case char === '3': return;
      }
    }
  })
  .on('cycle', function (event) {
    console.log(String(event.target));
  })
  .on('complete', function () {
    console.log('Fastest is ' + this.filter('fastest').map('name'));
  })
  .run();

Which yielded the following results

Unoptimized x 18,670,778 ops/sec ±0.45% (99 runs sampled)
Optimized x 105,950,167 ops/sec ±0.12% (101 runs sampled)
Switch true x 110,007,305 ops/sec ±0.07% (98 runs sampled)
Switch true strict x 110,274,698 ops/sec ±0.08% (100 runs sampled)
Fastest is Switch true strict

RyanCavanaugh commented 2 years ago

You're comparing strings, but the scanner uses (hardcoded) numbers in that switch.

MartinJohns commented 2 years ago

In the benchmark code you also return from within the switch, meaning the code will only ever iterate the first character.

But man, benchmarking JavaScript code is hard. I must be missing some hidden optimizations. With the "benchmark" JS library and running node I get vastly different results depending on the order of the tests. There does seem to be a considerable difference in NodeJS, but hardly any in Chrome.

This is the NodeJs code I tested:

const { Suite } = require('benchmark');
const suite = new Suite;

const str = 'assssaadaddaassdasdaassaaassaadaaaaassssaasaassasddaaaaasssaadaddaasasasassssaadaddaassdasdaassaaassaadaaaaassssaasaassasddaaaaasssaadaddaasasasassssaadaddaassdasdaassaaassaadaaaaassssaasaassasddaaaaasssaadaddaasasasassssaadaddaassdasdaassaaassaadaaaaassssaasaassasddaaaaasssaadaddaasasasassssaadaddaassdasdaassaaassaadaaaaassssaasaassasddaaaaasssaadaddaasasasassssaadaddaassdasdaassaaassaadaaaaassssaasaassasddaaaaasssaadaddaasasasassssaadaddaassdasdaassaaassaadaaaaassssaasaassasddaaaaasssaadaddaasasasassssaadaddaassdasdaassaaassaadaaaaassssaasaassasddaaaaasssaadaddaasasasassssaadaddaassdasdaassaaassaadaaaaassssaasaassasddaaaaasssaadaddaasasasassssaadaddaassdasdaassaaassaadaaaaassssaasaassasddaaaaasssaadaddaasasasassssaadaddaassdasdaassaaassaadaaaaassssaasaassasddaaaaasssaadaddaasasasassssaadaddaassdasdaassaaassaadaaaaassssaasaassasddaaaaasssaadaddaasasas';
const onlyAData = 'a'.repeat(str.length).split('').map(x => x.charCodeAt(0));
const mixedData = str.split('').map(x => x.charCodeAt(0));

suite
  .add('Mixed: Check a first', function () {
    var a = 0, other = 0;
    for (const char of mixedData) {
      switch (char) {
        case 97: ++a; break;
        case 98: ++other; break;
        case 99: ++other; break;
        case 100: ++other; break;
        case 101: ++other; break;
        case 102: ++other; break;
        case 103: ++other; break;
        case 104: ++other; break;
        case 105: ++other; break;
        case 106: ++other; break;
        case 107: ++other; break;
        case 108: ++other; break;
        case 109: ++other; break;
        case 110: ++other; break;
        case 111: ++other; break;
        case 112: ++other; break;
        case 113: ++other; break;
        case 114: ++other; break;
        case 115: ++other; break;
        case 116: ++other; break;
        case 117: ++other; break;
        case 118: ++other; break;
        case 119: ++other; break;
        case 120: ++other; break;
        case 121: ++other; break;
        case 122: ++other; break;
      }
    }
  })
  .add('Mixed: Check a last', function () {
    var a = 0, other = 0;
    for (const char of mixedData) {
      switch (char) {
        case 98: ++other; break;
        case 99: ++other; break;
        case 100: ++other; break;
        case 101: ++other; break;
        case 102: ++other; break;
        case 103: ++other; break;
        case 104: ++other; break;
        case 105: ++other; break;
        case 106: ++other; break;
        case 107: ++other; break;
        case 108: ++other; break;
        case 109: ++other; break;
        case 110: ++other; break;
        case 111: ++other; break;
        case 112: ++other; break;
        case 113: ++other; break;
        case 114: ++other; break;
        case 115: ++other; break;
        case 116: ++other; break;
        case 117: ++other; break;
        case 118: ++other; break;
        case 119: ++other; break;
        case 120: ++other; break;
        case 121: ++other; break;
        case 122: ++other; break;
        case 97: ++a; break;
      }
    }
  })
  .add('Only A: Check a first', function () {
    var a = 0, other = 0;
    for (const char of onlyAData) {
      switch (char) {
        case 97: ++a; break;
        case 98: ++other; break;
        case 99: ++other; break;
        case 100: ++other; break;
        case 101: ++other; break;
        case 102: ++other; break;
        case 103: ++other; break;
        case 104: ++other; break;
        case 105: ++other; break;
        case 106: ++other; break;
        case 107: ++other; break;
        case 108: ++other; break;
        case 109: ++other; break;
        case 110: ++other; break;
        case 111: ++other; break;
        case 112: ++other; break;
        case 113: ++other; break;
        case 114: ++other; break;
        case 115: ++other; break;
        case 116: ++other; break;
        case 117: ++other; break;
        case 118: ++other; break;
        case 119: ++other; break;
        case 120: ++other; break;
        case 121: ++other; break;
        case 122: ++other; break;
      }
    }
  })
  .add('Only A: Check a last', function () {
    var a = 0, other = 0;
    for (const char of onlyAData) {
      switch (char) {
        case 98: ++other; break;
        case 99: ++other; break;
        case 100: ++other; break;
        case 101: ++other; break;
        case 102: ++other; break;
        case 103: ++other; break;
        case 104: ++other; break;
        case 105: ++other; break;
        case 106: ++other; break;
        case 107: ++other; break;
        case 108: ++other; break;
        case 109: ++other; break;
        case 110: ++other; break;
        case 111: ++other; break;
        case 112: ++other; break;
        case 113: ++other; break;
        case 114: ++other; break;
        case 115: ++other; break;
        case 116: ++other; break;
        case 117: ++other; break;
        case 118: ++other; break;
        case 119: ++other; break;
        case 120: ++other; break;
        case 121: ++other; break;
        case 122: ++other; break;
        case 97: ++a; break;
      }
    }
  })
  .on('cycle', function (event) {
    console.log(String(event.target));
  })
  .on('complete', function () {
    console.log('Fastest is ' + this.filter('fastest').map('name'));
  })
  .run();

And the result:

Mixed: Check a first x 293,276 ops/sec ±0.76% (89 runs sampled)
Mixed: Check a last x 154,734 ops/sec ±1.15% (88 runs sampled)
Only A: Check a first x 565,584 ops/sec ±2.03% (91 runs sampled)
Only A: Check a last x 124,091 ops/sec ±0.55% (90 runs sampled)
Fastest is Only A: Check a first

A browser benchmark seems less drastic of a difference: https://jsben.ch/D9Rk1

But I don't know how reliable this library and the website is.

ElianCordoba commented 2 years ago

@RyanCavanaugh definitely yesterday was a long day 🙃. Today I re-runned that bench property, comparing numbers instead of strings, and I saw a ~1% performance improvement.

Leaving that simple benchmark aside today I created this repo to test both scanner functions, the current version and the one with my changes, they are quite simplified versions but I think it's a fair comparison and a more real to life example. My results show ~5% improvements

@MartinJohns Indeed javascript performance is hard to measure, check that repo I just shared where I was puzzled after getting always the first version of the scanner to run being faster than the second one, it didn't matter wherever was my version or the vanilla one. Then I realized that I had to manually assign undefined to the scan function to kind of free memory or trigger the GC or whatever happens in the back of V8 so that the second run results wouldn't be affected ¯_(ツ)_/¯

microsoft / TypeScript