janet-lang / janet

A dynamic language and bytecode vm
https://janet-lang.org
MIT License
3.43k stars 221 forks source link

Defer tuple hashing #1250

Closed primo-ppcg closed 1 year ago

primo-ppcg commented 1 year ago

I don't intend to make a PR about this, I'm merely presenting an idea for consideration.

Tuples are hashed upon creation. At first glance it seems like a good idea, as it speeds up repeated comparisons considerably.

However, this pre-computed hash is only used in two instances:

  1. When testing two tuples for equality.
  2. When the hash is requested directly, including:
    • in the core hash function
    • when used as a key in a struct (1 2 3 4)
    • when hashing a parent ds (1 2)

I think that both of the uses are fairly rare, especially if struct hashing were also to be deferred (in its entirety, which would avoid hashing its values unless necessary. The keys obviously must be hashed). It seems to me that the vast majority of tuples will complete their life cycle without their pre-computed hashes having been examined even once.

I've prepared an experimental branch (version 2) (version 3) which defers tuple hashing until needed. Primarily, I've changed janet_tuple_hash from a macro that extracts the hash from the header, to a function that checks if the hash has already been computed. The results seem pretty good:

(use spork/test)

(def a (range 100))

(timeit-loop [:repeat 10_000_000]
  "splice" [;a])
(timeit-loop [:repeat 100_000]
  "eachp " (eachp p a))
(timeit-loop [:repeat 100_000]
  "pairs " (pairs a))

master:

splice 2.834s, 0.2834µs/body
eachp  0.419s, 4.191µs/body
pairs  0.559s, 5.594µs/body

branch:

splice 0.794s, 0.07943µs/body
eachp  0.383s, 3.831µs/body
pairs  0.497s, 4.968µs/body

In particular for the splice form, hashing the tuple accounts for approximately 70% of the time taken. I think this is significant, as tuples are used essentially everywhere. Structs, and possibly also strings could also have their hashing deferred.

btnlq commented 1 year ago

What do you think about reserving a value to avoid the allocation?

@@ src/include/janet.h, 976 @@

- int32_t *hash;
+ int32_t hash;

@@ src/core/tuple.c, 40 @@

- head->hash = NULL;
+ head->hash = -1;

@@ src/core/tuple.c, 57 @@

  int32_t janet_tuple_hash(const Janet *tuple) {
      JanetTupleHead *head = janet_tuple_head(tuple);
-     if (head->hash == NULL) {
-         head->hash = (int32_t*)janet_malloc(sizeof(int32_t));
-         *head->hash = janet_array_calchash(tuple, head->length);
+     if (head->hash == -1) {
+         int32_t hash = janet_array_calchash(tuple, head->length);
+         head->hash = hash == -1 ? -2 : hash;
      }
      return head->hash;
  }
CosmicToast commented 1 year ago

What do you think about reserving a value to avoid the allocation?

A hash isn't really a number per se, but rather just some raw bytes that can be more easily handled by pretending it's a number. Internally, hashes are mixed via unsigned integer operations, including bit shifting etc.

It's not impossible to end up with a legitimate hash value that is -1. Consequently, having any sort of magic value would mean that those values are no longer representable.

As an alternative, there could be a general metadata bitfield. Something like an int32 called meta, with a MHASHED = 1 << 0 in it. It would then be possible to test for additional optimizations later (e.g sorted? anything really) while keeping the fact that it has been / hasn't been hashed on hand. This would also allow the use of this sort of amortizing technique for real array hashes (i.e ones calculated like they are for tuples -, simply unset that bit whenever the array is modified.

primo-ppcg commented 1 year ago

What do you think about reserving a value to avoid the allocation?

I thought of that, but wasn't clever enough to think of this:

        head->hash = hash == -1 ? -2 : hash;

-1 and -2 are unfortunate choices, though, because of this: https://github.com/janet-lang/janet/blob/7475362c85831da31430dc91e0c01aba0a80bcf9/src/core/value.c#L325 Perhaps -1 and 1 would be better. I've amended the branch to do this instead.

As an alternative, there could be a general metadata bitfield.

Also seems like it could be a good option, and not only for tuples!

bakpakin commented 1 year ago

If we are on the topic of micro optimizations, using 0 as the sentinel hash is probably a bit faster on most architectures, and will compile down to a single conditional instruction. I believe this is how the JVM does string hashing.

primo-ppcg commented 1 year ago

As an alternative, there could be a general metadata bitfield.

Actually, tuples already have flags, don't they?

int32_t janet_tuple_hash(const Janet *tuple) {
    JanetTupleHead *head = janet_tuple_head(tuple);
    if (janet_tuple_flag(tuple) & JANET_TUPLE_FLAG_HASHED)
        return head->hash;

    head->hash = janet_array_calchash(tuple, head->length);
    janet_tuple_flag(tuple) |= JANET_TUPLE_FLAG_HASHED;
    return head->hash;
}

Seems like a good option: https://github.com/primo-ppcg/janet/commit/51166ece48f6bff1288b995c2a6507936a69a1a9