mattbierner / hamt

Javascript Hash Array Mapped Trie
MIT License
251 stars 16 forks source link

Suggestion - bring back setHash, removeHash and add getNode #11

Closed MichaelOstermann closed 8 years ago

MichaelOstermann commented 8 years ago

This is mainly performance oriented, the main reasoning behind this is that through the use of setHash and removeHash you avoid unnecessary hamt.hash recalculations.

The idea of getNode is the same as get, but it returns the actual node instead, so you can access its hash value and further use it for subsequent setHash or removeHash calls.

In my particular use case, I'm building an abstraction on top of HAMT, which converts raw objects to a trie and provides additional pure update methods such as merge, deepMerge, filter, concat, etc.

In the case of deepMerge for example, I have to recursively iterate through all the target nodes descendants and either update their value or remove them altogether, so a deepMerge call potentially ends up with a lot of hamt.set or hamt.remove calls, each one recalculating the hash of the key. I've made a small modification to HAMT (v0.1.9) where hamt.get returns the actual nodes with the already calculated hashes and I'm getting a visible performance increase through this method (in the context of millions of calls).

What do you think?

mattbierner commented 8 years ago

I agree. I remove them in V1 to be conservative and to better support some potential internal optimization (Such as #10, where you need to be able to recover the hash from just a key), but I think having these low level methods is one of the main appeals of HAMT over alternate implementations.

One other point of interest is that the Immutable library memoizes its hash computations to avoid having to perform them over and over again.

Ontrack for V1.1 update.

MichaelOstermann commented 8 years ago

I would be cautious of memoizing stuff, this would be a sample implementation:

Function.prototype.memoize = function(){
    var self = this, cache = {};
    return function( arg ){
      if (arg in cache) {
        return cache[arg];
      } else {
        return cache[arg] = self( arg );
      }
    }
  }

The main problem here is that there is one single cache object - one of the advantages of HAMT is that it scales well with big datasets since the trie is split up into smaller buckets, instead of one large array. In comparison, lookups in the cache object will drastically decrease in performance the bigger it gets, especially since you always have 2 extra lookups here for each call, arg in cache and cache[arg].

mattbierner commented 8 years ago

Restored in V1.1 .