This implements a new hash table arrangement in seedForth-tokenizer.fs. If I understand the previous code correctly, I think it makes a bit of a simplification in that it allocates a large hash table to make collisions unlikely, but can't then handle collisions.
So with the new method I have allocated a 1024-entry symbol table and I'm using closed hashing, in which a CRC function gives the preferred entry for a given symbol, but if that entry is occupied by another symbol it looks circularly for a free entry.
The main reason why I wanted to make this change is so that the tokenizer can be self-hosting on Z80, which does not have the memory for a large hash table. With the new way, there's no reliance on 32-bit arithmetic either, which is useful for Z80.
The choice of a 1024-entry table is due to a change I made in the last PR which freed up token 3 for additional token space:
\ token are in the range 0 .. 1023:
\ 0, 4 .. 255 are single byte tokens
\ 256 .. 511 are double byte tokens of the form 01 xx
\ 511 .. 767 are double byte tokens of the form 02 xx
\ 768 .. 1023 are double byte tokens of the form 03 xx
It would of course be easy to make the table larger on a target like i386 that can support it. If so, then the current CRC10 could be changed to say CRC12 or CRC16 (I really hate the idea of calculating a CRC and then masking it off to lose its properties!).
Note: There are minor bugs in this PR, as follows:
I had directly invoked gforth in Makefile instead of $(HOSTFORTH). It is fixed in #12 so I have not fixed it here.
A hash table entry is considered vacant between Macro and end-macro, and could be reused by accident, though it would be unlikely. It is fixed in #16 so I have not fixed it here.
If you do want the fixed version of this PR see the branch tokenizer_hash_table1 in my github account. I wouldn't recommend using that branch though, because it will cause conflicts later when mergining #12, #16 and others.
This implements a new hash table arrangement in
seedForth-tokenizer.fs
. If I understand the previous code correctly, I think it makes a bit of a simplification in that it allocates a large hash table to make collisions unlikely, but can't then handle collisions.So with the new method I have allocated a 1024-entry symbol table and I'm using closed hashing, in which a CRC function gives the preferred entry for a given symbol, but if that entry is occupied by another symbol it looks circularly for a free entry.
The main reason why I wanted to make this change is so that the tokenizer can be self-hosting on Z80, which does not have the memory for a large hash table. With the new way, there's no reliance on 32-bit arithmetic either, which is useful for Z80.
The choice of a 1024-entry table is due to a change I made in the last PR which freed up token 3 for additional token space:
It would of course be easy to make the table larger on a target like i386 that can support it. If so, then the current CRC10 could be changed to say CRC12 or CRC16 (I really hate the idea of calculating a CRC and then masking it off to lose its properties!).
Note: There are minor bugs in this PR, as follows:
gforth
inMakefile
instead of$(HOSTFORTH)
. It is fixed in #12 so I have not fixed it here.Macro
andend-macro
, and could be reused by accident, though it would be unlikely. It is fixed in #16 so I have not fixed it here.If you do want the fixed version of this PR see the branch
tokenizer_hash_table1
in my github account. I wouldn't recommend using that branch though, because it will cause conflicts later when mergining #12, #16 and others.