uho / preForth

a minimalistic Forth kernel that can bootstrap
GNU General Public License v3.0
72 stars 9 forks source link

Tokenizer hash table #10

Open nickd4 opened 2 years ago

nickd4 commented 2 years ago

This implements a new hash table arrangement in seedForth-tokenizer.fs. If I understand the previous code correctly, I think it makes a bit of a simplification in that it allocates a large hash table to make collisions unlikely, but can't then handle collisions.

So with the new method I have allocated a 1024-entry symbol table and I'm using closed hashing, in which a CRC function gives the preferred entry for a given symbol, but if that entry is occupied by another symbol it looks circularly for a free entry.

The main reason why I wanted to make this change is so that the tokenizer can be self-hosting on Z80, which does not have the memory for a large hash table. With the new way, there's no reliance on 32-bit arithmetic either, which is useful for Z80.

The choice of a 1024-entry table is due to a change I made in the last PR which freed up token 3 for additional token space:

\ token are in the range 0 .. 1023:
\   0, 4 ..  255 are single byte tokens
\    256 ..  511 are double byte tokens of the form 01 xx
\    511 ..  767 are double byte tokens of the form 02 xx
\    768 .. 1023 are double byte tokens of the form 03 xx

It would of course be easy to make the table larger on a target like i386 that can support it. If so, then the current CRC10 could be changed to say CRC12 or CRC16 (I really hate the idea of calculating a CRC and then masking it off to lose its properties!).

Note: There are minor bugs in this PR, as follows:

If you do want the fixed version of this PR see the branch tokenizer_hash_table1 in my github account. I wouldn't recommend using that branch though, because it will cause conflicts later when mergining #12, #16 and others.