robert-strandh / SICL

A fresh implementation of Common Lisp
Other
1.07k stars 79 forks source link

Some improvements to hash tables #156

Closed no-defun-allowed closed 4 years ago

no-defun-allowed commented 4 years ago
kchanqvq commented 3 years ago

Bruh so all cons has same hash code.

robert-strandh commented 3 years ago

No, in the native system, they would be hashed by address.

kchanqvq commented 3 years ago

Is there code for the hash by address implementation yet? I wonder how it could deal with moving GC. cons doesn't seem to have a hash slot?

robert-strandh commented 3 years ago

Right, there is no hash slot in a CONS. The idea is that when a CONS is used as a hash key, it is migrated to the global heap where things don't move. Then the address can be used as a hash key.

But we haven't yet implemented bootstrapping code to assign an address in the simulated heap for this case.

kchanqvq commented 3 years ago

Hmm, how could this migration happen then? It require fixing existing references to the CONS. Does it triggers an immediate GC to fix references? (which will be hella slow when adding lots of CONSes as keys consecutively) Or do we put an forward pointer and check for it for every CONS access (which slows down every CONS access)?

no-defun-allowed commented 3 years ago

Yes, migration speed is being considered. From what I remember from IRC on GC design:

Another option which SBCL uses would be to invalidate hash tables, so that they have to be rehashed after the (mostly-copying) garbage collector runs. This could be made to work, as only thread-local nurseries move objects, and the nursery GC stops the mutator thread to do so. But this requires the garbage collector to mark hash tables for rehashing, and hash table functions now might need to check if the GC invalidated the table in the middle of a function (depending on the implementation technique).

Admittedly, I'm not sure if I like the idea of any hash keys escaping to the global heap, since that would also imply that anything else reachable from the hash table cannot be thread-local either. On the other hand, I can't think of any times where I used a cons cell as a key for an eq hash table; but I'd be interested to hear if anyone has found a use for such a situation.

no-defun-allowed commented 3 years ago

A quick "survey" in #commonlisp suggests that some people have used eq hash tables with cons cells, but most don't remember why. Shinmera mentioned one good use wherein a data structure involving cons cells is traversed, and a hash table is used to avoid traversing it twice; and then I remembered that I also have done this before.

kchanqvq commented 3 years ago

I use it for hash-consing, which I think should be a very common usage -- unless people nowadays don't do symbolic manipulation any more.

Although technically hash-consing only makes sense for immutable CONS, alas there isn't really immutable CONS in standard CL.

I also use it to attach general information to CONSes, which is handy when propogating informations during symbolic transformations. IDK if it's the right thing to do, but it sounds reasonable. The alternative will be define a custom TRONS datatype that have a plist and use it in place of CONS, eww.

no-defun-allowed commented 3 years ago

To the best of my knowledge, a hash consing table would use an equal test on its keys, as we want the two elements of (list (cons 1 2) (cons 1 2)) to be hashed to the same unique object. But annotating cons cells is also an interesting use of an eq hash table.

kchanqvq commented 3 years ago

Using equal is darn too slow. The general idea is to define hash consing recursively, i.e. if a CONS to be hash-consed always have its CAR and CDR already hash-consed, then you can save the traverse on every CONS, by using a hash that only involves object identity of the CAR and CDR without descending any further.

There're various tricks to do so, but all involves using object identity hash of CONS. An apparent non portable way is to write a custom hash function using implementation-provided EQ-hash. Or to do it portably one can keep an EQ-hash from CONSes to "indirector" structs, and another EQUAL-hash from CONSes of "indirector" structs to real CONSes (stripped away the indirection).

Edit: I realized the second "portable" way might actually not work well (I never really used it as I can find and abuse internal eq-hash definition in most CL implementation), because lots of CL implementation just gives the same hash for all struct. But you get the idea, use some indirection hack that stop the descending in EQUAL-hash. Maybe use GENSYMs.

moon-chilled commented 3 years ago

Hash-consed objects will probably live a long time, so the overhead of moving will be amortized. (Nothing specific to hash-consing really; just generational gc hypothesis in general.) Also: there's a space penalty associated with rc; since there's no space in a cons cell, it has to go out-of-line.

data structure involving cons cells is traversed, and a hash table is used to avoid traversing it twice

I wonder if you rig it so that you use the local ptr as the hash w/o moving to the global heap, and when/if the table or one of its keys moves it gets told to rehash.

no-defun-allowed commented 3 years ago

Also: there's a space penalty associated with rc; since there's no space in a cons cell, it has to go out-of-line.

For "full" refcounting systems this is true, but not so much for one bit reference counting, wherein the usual mark bitmap is used for refcounts, and the cleanliness system I linked only uses two bits.

I wonder if you rig it so that you use the local ptr as the hash w/o moving to the global heap, and when/if the table or one of its keys moves it gets told to rehash.

Yes, SBCL does this to my knowledge.

no-defun-allowed commented 2 years ago

@BlueFlo0d Here's a little trick I heard from one of the PyPy developers.

We maintain a "shadow" cons cell in the global heap for each young cons cell that needs to have a stable address for hashing, and associate the young cell with the shadow in a hash table. The shadow then can be used to produce an address for hashing the young cons cell. When it is time to perform a minor collection, we promote each young cell in the hash table if it is still live, replacing the contents of the shadow with the contents of the young cell, and clear the hash table. This hash table can use address hashing for young objects, because no entries persist across GC, and thus addresses are never invalidated.

If the hash table and keys are short-lived, then we avoid promoting anything (but we still generate garbage in the global heap). Else we at least defer fixing up pointers until minor GC, when we fix every pointer in one go.