In the shell backend, the variables passed to defstr are allocated sequentially, even when the same string is used multiple times.
The laurent/deduplicate_defstr_strings branch implements sharing of string variables for identical strings. This requires interning strings like we do for identifiers (using the same table) which slows down tokenizing of identifiers and strings. This is because there are more conflicting entries in the hash table which results in linear probing. This results in a slower bootstrap for a minor benefit in code quality (and even then it's debatable since it can make it harder to associate string and string variables and moves pnut away from being single pass).
There seems to be a few options:
Using a larger hash table seems to help (even with a modest increase with HASH_PARAM = 1026, HASH_PRIME = 1009), but it's unclear what size we should use. For larger programs, we'd probably want a larger table.
Use a different hashing algorithm.
This is low priority, so creating a ticket to dump the progress on this problem.
Context
In the shell backend, the variables passed to
defstr
are allocated sequentially, even when the same string is used multiple times.The
laurent/deduplicate_defstr_strings
branch implements sharing of string variables for identical strings. This requires interning strings like we do for identifiers (using the same table) which slows down tokenizing of identifiers and strings. This is because there are more conflicting entries in the hash table which results in linear probing. This results in a slower bootstrap for a minor benefit in code quality (and even then it's debatable since it can make it harder to associate string and string variables and moves pnut away from being single pass).There seems to be a few options:
HASH_PARAM = 1026
,HASH_PRIME = 1009
), but it's unclear what size we should use. For larger programs, we'd probably want a larger table.This is low priority, so creating a ticket to dump the progress on this problem.