vigna / Sux4J

Sux4J is an effort to bring succinct data structures to Java.
GNU Lesser General Public License v2.1
154 stars 22 forks source link

GOVMinimalPerfectHashFunction, optimize the call to getLong #2

Closed okrische closed 7 years ago

okrische commented 8 years ago

I use the hash function to create unique hashes for millions of unique keys. Each call to getLong() creates int[3] and long[3], which stresses the GC and could be avoided, imho.

So i can use getLongByTriple(long[]), then i have at least the long[3] gone. But it still requires me to subclass the function, so that i get access to the transforming strategy and globalSeed.

And it could be even nicer to have access to getLongByTripleNoCheck(long[], int[]), since i know the keys and do not need the signature check either. But it is a private method.

So what could be an alternative?

I understand, that i need to reset int[] and long[] to 0 with each call. Also you probably want to hide this kind of implementation detail.

Maybe one can provide another structure, like a buffer, that has int[] and long[], but is only visible to the function. And one only has to make sure, that this buffer is not shared within different threads.

GOVBuffer buffer = function.createBuffer();
for(String key : keys) {
   long hash = function.getLong(key, buffer);
}

public class GOVMinimalPerfectHashFunction  ... {
  ...
  public long getLong(T key, GOVBuffer buff) {
       ...
  }
  public static class GOVBuffer {
      private final int[] e = new int[3];
      private final long[] h = new int[3];
      private void reset() {
             int[0] = 0;
             ...
      }
  }
}
vigna commented 8 years ago

No, you don't need to reset the arrays—they will be filled with the relevant data.

I know, the creation of those two arrays is a nuisance. One thing you could do is to have your own subclass that contains those two buffers as attributes. Of course, at that point the class is no longer thread safe, but you can implement a copy() method (as in FlyweightPrototype) that copies by reference the static data and creates a new instance with different buffers.

I think this approach would be less cumbersome for the caller than having to specify each time a new buffer.

okrische commented 8 years ago

Why do i have to specifiy each time a new buffer? It is just for loops. You get once the buffer before the loop and then you use it again and again within the loop.

I thought about sub-classing. Would it really help me? I have no access to the needed private variables. So i can not copy the code. Plus its work to align it with the latest developments. Also i share the hash function in different threads, they would all use their own buffer and i do not want to clone the hash function to keep it thread-safe (its already several megabytes)

So i still think, its the best to either have either

(Now i use even reflection to get access to this little private method getLongByTripleNoCheck)

vigna commented 8 years ago

For each loop, you'd have to specify a new buffer. It would be quite combersome.

I'd rather have a method that takes two support arrays. Creating an entire data structure seems overkill. Would you like that?

okrische commented 8 years ago

I agree, for each loop i have to do that. But for the method, that takes the support arrays, the same argument is valid. I additionally have to know the minimum or maximum length of the support arrays. While with a special support buffer i do not need to know this kind of detail. Its not so easy to change that detail afterwards anymore, like, when you need an array of at least 4 or 5 elements.

In any way, I would be happy to have it either way! Would speed things up a lot for me. Thank you for considering!