Closed no-defun-allowed closed 4 years ago
Bruh so all cons
has same hash code.
No, in the native system, they would be hashed by address.
Is there code for the hash by address implementation yet? I wonder how it could deal with moving GC. cons
doesn't seem to have a hash slot?
Right, there is no hash slot in a CONS. The idea is that when a CONS is used as a hash key, it is migrated to the global heap where things don't move. Then the address can be used as a hash key.
But we haven't yet implemented bootstrapping code to assign an address in the simulated heap for this case.
Hmm, how could this migration happen then? It require fixing existing references to the CONS. Does it triggers an immediate GC to fix references? (which will be hella slow when adding lots of CONSes as keys consecutively) Or do we put an forward pointer and check for it for every CONS access (which slows down every CONS access)?
Yes, migration speed is being considered. From what I remember from IRC on GC design:
Another option which SBCL uses would be to invalidate hash tables, so that they have to be rehashed after the (mostly-copying) garbage collector runs. This could be made to work, as only thread-local nurseries move objects, and the nursery GC stops the mutator thread to do so. But this requires the garbage collector to mark hash tables for rehashing, and hash table functions now might need to check if the GC invalidated the table in the middle of a function (depending on the implementation technique).
Admittedly, I'm not sure if I like the idea of any hash keys escaping to the global heap, since that would also imply that anything else reachable from the hash table cannot be thread-local either. On the other hand, I can't think of any times where I used a cons cell as a key for an eq
hash table; but I'd be interested to hear if anyone has found a use for such a situation.
A quick "survey" in #commonlisp suggests that some people have used eq
hash tables with cons cells, but most don't remember why. Shinmera mentioned one good use wherein a data structure involving cons cells is traversed, and a hash table is used to avoid traversing it twice; and then I remembered that I also have done this before.
I use it for hash-consing, which I think should be a very common usage -- unless people nowadays don't do symbolic manipulation any more.
Although technically hash-consing only makes sense for immutable CONS, alas there isn't really immutable CONS in standard CL.
I also use it to attach general information to CONSes, which is handy when propogating informations during symbolic transformations. IDK if it's the right thing to do, but it sounds reasonable. The alternative will be define a custom TRONS datatype that have a plist and use it in place of CONS, eww.
To the best of my knowledge, a hash consing table would use an equal
test on its keys, as we want the two elements of (list (cons 1 2) (cons 1 2))
to be hashed to the same unique object. But annotating cons cells is also an interesting use of an eq
hash table.
Using equal
is darn too slow. The general idea is to define hash consing recursively, i.e. if a CONS to be hash-consed always have its CAR and CDR already hash-consed, then you can save the traverse on every CONS, by using a hash that only involves object identity of the CAR and CDR without descending any further.
There're various tricks to do so, but all involves using object identity hash of CONS. An apparent non portable way is to write a custom hash function using implementation-provided EQ-hash. Or to do it portably one can keep an EQ-hash from CONSes to "indirector" structs, and another EQUAL-hash from CONSes of "indirector" structs to real CONSes (stripped away the indirection).
Edit: I realized the second "portable" way might actually not work well (I never really used it as I can find and abuse internal eq-hash definition in most CL implementation), because lots of CL implementation just gives the same hash for all struct. But you get the idea, use some indirection hack that stop the descending in EQUAL-hash. Maybe use GENSYMs.
Hash-consed objects will probably live a long time, so the overhead of moving will be amortized. (Nothing specific to hash-consing really; just generational gc hypothesis in general.) Also: there's a space penalty associated with rc; since there's no space in a cons cell, it has to go out-of-line.
data structure involving cons cells is traversed, and a hash table is used to avoid traversing it twice
I wonder if you rig it so that you use the local ptr as the hash w/o moving to the global heap, and when/if the table or one of its keys moves it gets told to rehash.
Also: there's a space penalty associated with rc; since there's no space in a cons cell, it has to go out-of-line.
For "full" refcounting systems this is true, but not so much for one bit reference counting, wherein the usual mark bitmap is used for refcounts, and the cleanliness system I linked only uses two bits.
I wonder if you rig it so that you use the local ptr as the hash w/o moving to the global heap, and when/if the table or one of its keys moves it gets told to rehash.
Yes, SBCL does this to my knowledge.
@BlueFlo0d Here's a little trick I heard from one of the PyPy developers.
We maintain a "shadow" cons cell in the global heap for each young cons cell that needs to have a stable address for hashing, and associate the young cell with the shadow in a hash table. The shadow then can be used to produce an address for hashing the young cons cell. When it is time to perform a minor collection, we promote each young cell in the hash table if it is still live, replacing the contents of the shadow with the contents of the young cell, and clear the hash table. This hash table can use address hashing for young objects, because no entries persist across GC, and thus addresses are never invalidated.
If the hash table and keys are short-lived, then we avoid promoting anything (but we still generate garbage in the global heap). Else we at least defer fixing up pointers until minor GC, when we fix every pointer in one go.
bucket-hash-table
s have random "offsets", which are used to ensure hash tables hash their keys differently, making complexity attacks much more difficult.Some portable hash functions have been added (
sicl-hash-table:eq-hash
,equal-hash
,equalp-hash
), as well as an implementation ofsxhash
. (eq-hash
admittedly has very little to work with, but should perform well with numeric and symbol keys.)Custom hash functions can be provided to
bucket-hash-table
s through the:hash-function
keyword argument, which takes an unsigned 64-bit offset and an object to hash and returns an unsigned 64-bit hash.A
print-object
method was defined for hash tables, which prints something like#<SICL-BUCKET-HASH-TABLE:BUCKET-HASH-TABLE :test #<FUNCTION STRING=> size 1/16 {101C3C3253}>
(using custom hash and test functions).Oh, and Cluster was added as a dependency to the README.