Please explain dictionary entry header

aw commented 3 years ago

I'm trying to understand derzforth's word header but I can't seem to grok how you're defining it.

The sectorforth's diagram of a word's dictionary entry looks like this:

; Each dictionary entry is laid out in memory as such:
;
; *--------------*--------------*--------------*--------------*
; | Link pointer | Flags+Length |    Name...   |    Code...   |
; *--------------*--------------*--------------*--------------*
;     2 bytes         1 byte      Length bytes     Variable
;

I made a diagram for derzforth:

+--------------+-------+-----------+------------+
| Link pointer | Flags | Word Name | Code Field |
---------------+-------+-----------+------------+
  32-bits        2 bits  30 bits     32-bits

My questions to better understand derzforth:

Can you confirm if this is accurate?
Why is the length not saved in the header as in other Forths?
Why are the flags part of the Name instead of a separate byte/word?
Why is the Code field always enter?

I think with this knowledge I'll be better positioned to contribute here.

Thanks!

aw commented 3 years ago

To follow-up on the above comment, which relates to what I hinted at in #10:

I think the Length of word plays an important role, in that it's possible to quickly skip a word lookup by simply comparing its length. I'm aware the hashes all have the same length (32-bits), but that now makes it mandatory to first hash the word to find a match, as opposed to using a much quicker/simpler heuristic.
Word collisions have a low probably with 30 bits of entropy, but I feel 32-bits would still be better.
Could we save 4 bytes for each new dictionary word by automatically assuming its Code field is enter ?

aw commented 3 years ago

OK I think I can answer my own questions here:

Length is not saved because it's not necessary. If it was used for lookups and a match is found, we would still need to compare either a hash of the words or the words themselves... Hashing is much more efficient.
30-bits entropy for the word name is more than enough considering the limited size of the dictionary (total 12KB / with Forth itself?).

Still not sure about the Code field... is it just a pointer to the body of the word definition? i.e: always changing?

theandrew168 commented 3 years ago

Sorry for the delay here! I'm considering doing a "fancy" write-up of these questions / answers and making a nice chunk of reference documentation from it (these are solid questions).

But yea, we don't need Length since DerzForth uses a fixed-size (4 byte) hash value instead. Some old school Forths use a strategy of encoding the length of the word name (1 byte) plus the first three letters of the name (3 bytes) to identify words. IMO, that's really just a terrible string hash. DJB2 is a much better solution that dwarfs its extremely low chance of collision.

Each word requires two bits of metadata: immediate and hidden. I could move these bits into a whole new byte but that'd waste 6 bits. I think that cutting down the hash value range to 2^30 is better tradeoff for this particular embedded-focused Forth implementation. As you mentioned, this likely still exceeds how many words you'd be able to define given the tight memory limit of each target device.

The Code field holds the address of the machine code for a given word. For builtin / primitive words, it just points to 4 bytes ahead (since the ASM code for a primitive word immediately follows its header definition). For secondary words, the Code field points to the inner interpreter routine named ENTER (sometimes known as DOCOLON). To quote Brad Rodriguez's amazing Moving Forth series:

But, SQUARE is a high-level "colon" definition -- it holds a "thread", a list of addresses. In order
to perform this definition, the Forth interpreter must be re-started at a new location: the Parameter
Field of SQUARE. Of course, the interpreter's old location must be saved, to resume the "other"
Forth word once SQUARE is finished. This is just like a subroutine call! The machine language action
of SQUARE is simply to push the old IP, set IP to a new location, run the interpreter, and when
SQUARE is done pop the IP. (As you can see, the IP is the "program counter" of high-level Forth.)
This is called DOCOLON or ENTER in various Forths:

    PUSH IP     onto the "return address stack"
    W+2 -> IP   W still points to the Code Field, so W+2 is 
                the address of the Body!  (Assuming a 2-byte
                address -- other Forths may be different.)
    JUMP to interpreter ("NEXT")

DerzForth's implementation of ENTER is functionally equivalent:

sw IP, 0(RSP)     # IP -> [RSP]
addi RSP, RSP, 4  # RSP += 4
addi IP, W, 4     # IP = W + 4 (skip code field)
j next

aw commented 2 years ago

Thank you for the explanation. I think DOCOL was the missing link here. I read about it but didn't realize your ENTER has the same role. This makes much more sense now!

theandrew168 / derzforth

Please explain dictionary entry header #11