vigna / fastutil

fastutil extends the Java™ Collections Framework by providing type-specific maps, sets, lists and queues.
Apache License 2.0
1.81k stars 199 forks source link

Object to primitive array based maps with custom indexer #254

Closed Marcono1234 closed 2 years ago

Marcono1234 commented 3 years ago

What do you think about Object2PrimitiveMap classes which are array based and directly allow the user to specify a mapping function / an indexer to calculate key indices. The use case for such a map would be:

This might sound somewhat similar to the existing ...CustomHashMap classes with the difference that it can be implemented using a single primitive array for value storage and a bit vector for recording whether an entry is present compared to the two arrays of the hash map (and excess space due to load factor).

Such a map would be well suited for enums (their ordinal() value can be used as index) (see also #148), but also easily allows subset of enum constants or other custom singleton data types. In my concrete use case I implemented something similar to this to create maps of Character.UnicodeScript, which has 157 constants, but where the application only uses 18 of them.

The 'indexer' used to map key to index can be shared by multiple maps, this avoids unnecessary enum values() array copies for each map instance.

However, I am also mainly interested in your feedback regarding this idea. I am not really familiar with efficient primitive data structures and am therefore not sure how good performance of such a data structure would be or how well the JVM would handle it, you probably have more experience in this area.

Below is a sample implementation, note however that I am not planning to submit a pull request: (not fully tested, but hopefully you get the idea)

/**
 * Converts map keys to internal indices and vice versa. Indexers should be thread-safe
 * and can be shared by multiple map instances.
 */
interface KeyIndexer<K> {
    /** Returns the number of indices this indexer uses. */
    int indicesCount();

    /**
    * Returns an index for the key in range 0..({@link #indicesCount()} - 1). May return
    * {@value #NO_INDEX} if the key is not supported.
    */
    int keyToIndex(K key);

    /**
    * Returns the key for an index. This is the reverse function of {@link #keyToIndex(K)}.
    * Behavior is undefined when an index not supported by this indexer is provided.
    */
    K indexToKey(int index);

    int NO_INDEX = -1;

    /** Creates an indexer for a subset of all enum constants. */
    static <E extends Enum<E>> KeyIndexer<E> fromEnumConstants(Set<E> constants) {
       if (constants.isEmpty()) {
          throw new IllegalArgumentException("Constants must not be empty");
       }
       if (constants.contains(null)) {
          throw new IllegalArgumentException("Constants must not contain null");
       }

       Class<E> enumClass = constants.iterator().next().getDeclaringClass();
       E[] allConstants = enumClass.getEnumConstants();
       if (allConstants.length == constants.size()) return forAllEnumConstants(enumClass);

       int[] ordinalToIndex = new int[allConstants.length];
       Arrays.fill(ordinalToIndex, NO_INDEX);

       int index = 0;
       for (E constant : constants) {
          ordinalToIndex[constant.ordinal()] = index;
          index++;
       }

       Object[] indexToConstant = new Object[index];
       for (E constant : constants) {
          indexToConstant[ordinalToIndex[constant.ordinal()]] = constant;
       }

       final int indicesCount = index;
       return new KeyIndexer<E>() {
          @Override
          public int indicesCount() {
             return indicesCount;
          }

          @Override
          public int keyToIndex(E key) {
             return ordinalToIndex[key.ordinal()];
          }

          @SuppressWarnings("unchecked")
          @Override
          public E indexToKey(int index) {
             return (E) indexToConstant[index];
          }
       };
    }

    static <E extends Enum<E>> KeyIndexer<E> forAllEnumConstants(Class<E> enumClass) {
       E[] enumConstants = enumClass.getEnumConstants();
       return new KeyIndexer<E>() {
          @Override
          public int indicesCount() {
             return enumConstants.length;
          }

          @Override
          public int keyToIndex(E key) {
             return key.ordinal();
          }

          @Override
          public E indexToKey(int index) {
             return enumConstants[index];
          }
       };
    }
}

/* Sample Map implementation */
class Object2IntIndexedArrayMap<K> {
    private final KeyIndexer<K> keyIndexer;
    private final long[] entryBitSet;
    private final int[] values;
    private int size = 0;

    public Object2IntIndexedArrayMap(KeyIndexer<K> keyIndexer) {
        this.keyIndexer = keyIndexer;
        values = new int[keyIndexer.indicesCount()];
        entryBitSet = new long[1 + (values.length / Long.SIZE)];
    }

    private boolean hasEntry(long section, int index) {
        return (section & (1L << (index % Long.SIZE))) != 0L;
    }

    private boolean hasValue(int index) {
        return hasEntry(entryBitSet[index / Long.SIZE], index);
    }

    public boolean containsKey(K key) {
        return hasValue(keyIndexer.keyToIndex(key));
    }

    public int get(K key) {
        // Value or default 0
        return values[keyIndexer.keyToIndex(key)];
    }

    public int getOrDefault(K key, int defaultValue) {
        int index = keyIndexer.keyToIndex(key);
        if (hasValue(index)) {
           return values[index];
        } else {
           return defaultValue;
        }
    }

    public int putInt(K key, int value) {
        int index = keyIndexer.keyToIndex(key);
        int old = values[index];
        values[index] = value;
        long section = entryBitSet[index / Long.SIZE];
        if (!hasEntry(section, index)) {
            size++;
        }
        entryBitSet[index / Long.SIZE] = section | (1L << (index % Long.SIZE));

        return old;
    }

    public int remove(K key) {
        int index = keyIndexer.keyToIndex(key);
        int value = values[index];
        values[index] = 0; // Set to 0 since other methods rely on this default value
        long section = entryBitSet[index / Long.SIZE];
        if (hasEntry(section, index)) {
            size--;
        }
        entryBitSet[index / Long.SIZE] = section & ~(1L << (index % Long.SIZE));

        return value;
    }

    public void clear() {
        Arrays.fill(entryBitSet, 0L);
        Arrays.fill(values, 0);
        size = 0;
    }

    public int size() {
        return size;
    }
}

Usage example:

enum MyEnum {
    E0, E1, E2, E3, E4, E5, E6, E7, E8, E9, E10, E11, E12, E13, E14, E15,
    E16, E17, E18, E19, E20, E21, E22, E23, E24, E25, E26, E27, E28, E29, E30, E31,
    E32, E33, E34, E35, E36, E37, E38, E39, E40, E41, E42, E43, E44, E45, E46, E47,
    E48, E49, E50, E51, E52, E53, E54, E55, E56, E57, E58, E59, E60, E61, E62, E63,
    E64, E65, E66, E67, E68, E69, E70, E71, E72, E73, E74, E75, E76, E77, E78, E79,
    E80, E81, E82, E83, E84, E85, E86, E87, E88, E89, E90, E91, E92, E93, E94, E95,
    E96, E97, E98, E99, E100, E101, E102, E103, E104, E105, E106, E107, E108, E109, E110, E111,
    E112, E113, E114, E115, E116, E117, E118, E119, E120, E121, E122, E123, E124, E125, E126, E127,
    E128, E129, E130, E131, E132, E133, E134, E135, E136, E137, E138, E139, E140, E141, E142, E143,
    E144, E145, E146, E147, E148, E149, E150, E151, E152, E153, E154, E155, E156, E157, E158, E159,
    E160, E161, E162, E163, E164, E165, E166, E167, E168, E169, E170, E171, E172, E173, E174, E175,
    E176, E177, E178, E179, E180, E181, E182, E183, E184, E185, E186, E187, E188, E189, E190, E191,
    E192, E193, E194, E195, E196, E197, E198, E199, E200, E201, E202, E203, E204, E205, E206, E207,
    E208, E209, E210, E211, E212, E213, E214, E215, E216, E217, E218, E219, E220, E221, E222, E223,
    E224, E225, E226, E227, E228, E229, E230, E231, E232, E233, E234, E235, E236, E237, E238, E239,
    E240, E241, E242, E243, E244, E245, E246, E247, E248, E249, E250, E251, E252, E253, E254, E255;

    public static void main(String[] args) {
        Object2IntIndexedArrayMap<MyEnum> map = new Object2IntIndexedArrayMap<>(KeyIndexer.forAllEnumConstants(MyEnum.class));
        for (MyEnum e : values()) {
            map.putInt(e, e.ordinal() * 2);
        }

        for (MyEnum e : values()) {
            boolean contains = map.containsKey(e);
            int get = map.get(e);
            int removed = map.remove(e);
            boolean containsRemoved = map.containsKey(e);
            int size = map.size();

            if (!contains || containsRemoved || get != removed || get != e.ordinal() * 2 || size != 255 - e.ordinal()) {
                throw new AssertionError("Implementation error");
            }
        }
    }
}
incaseoftrouble commented 3 years ago

I think this use case is quite special, the efficiency of this relies on several assumptions: You both need to find a mapping of your real domain into [0, n] (where n is reasonably small, but not too small, otherwise enum maps etc. are more efficient) and that [0, n] should be filled quite densely (otherwise a int2int hash map would be more efficient).

I can see the usecase for mapping e.g. an alphabet (on the other hand, not sure if you would need that there, since you could directly map it with an array). However, I think this is too specialized to be contained in a general purpose library, in particular the implementation simply is not useable at all for anything else - its not that it would be slow, it would simply break (this does not mean that it is a bad thing to do, its simply a specialized implementation!). So you can't really use it as a generic implementation of a Map. And, if I understand correctly, the main point of the implementation is to obtain space savings, so this would likely only pay off if you have thousands of these maps, all satisfying the assumptions (if the data is sparse, you could use hash maps, if the data is completely dense, you could use an array).

One intermediate solution which is a bit less efficient given the assumptions but works in general would maybe be to implement an X2YMap which takes X2ZFunction (together with an inverse) and Z2YMap and transparently remaps keys. Combine this with my Nat2IntArrayMap and you pretty much get your behaviour (except bit set compression but no second array at all).

BTW: java.util.Bitset could be helpful for implementation, especially for writing iterators :-)

Marcono1234 commented 3 years ago

Thanks for your extensive feedback!

otherwise enum maps etc. are more efficient

The proposed implementation could effectively act as enum map, its implemention is pretty similar to the standard Java EnumMap which also uses a single pre-sized array for the values; both have O(1) lookup time. The only overhead the proposed solution adds comes from the indirection through the KeyIndexer, but maybe that is negligible (has to be tested).

since you could directly map it with an array

Manually maintaining an array would work, but you would have to repeat the boilerplate code for conversion from key to index and vice versa. Additionally you would have to write all methods (such as containsKey) yourself.

it would simply break

You are right, that is the main issue. If the map is created for enum constants [A, B, C] and you try to put a value for D it would fail.

BTW: java.util.Bitset could be helpful for implementation, especially for writing iterators :-)

BitSet contains logic for resizing which is not necessary here, therefore I went for a manual bit set. But that is probably premature optimization; using BitSet would work here as well and would indeed simplify some tasks.

its simply a specialized implementation

I think this summarizes it pretty well. For my use case I even went for a more specialized (and simplified) implementation similar to your Nat2IntArrayMap, but I thought a more general solution (as proposed here) might be helpful.

If you don't mind, I would leave this issue open to see if anyone else is interested in this. But I understand that such a map might be too specific to be provided by such a general purpose library.

incaseoftrouble commented 3 years ago

Thanks for your extensive feedback!

Thanks for the extensive suggestion ;)

The proposed implementation could effectively act as enum map, its implemention is pretty similar to the standard Java EnumMap which also uses a single pre-sized array for the values; both have O(1) lookup time. The only overhead the proposed solution adds comes from the indirection through the KeyIndexer, but maybe that is negligible (has to be tested).

Indeed, the difference is negligible. I meant that in case the data isn't sparse (e.g. you only consider a subset of the complete enum set), you'd be saving some space (since the value array isn't as large). But this difference only would matter if data is really sparse and you'd have a ton of these maps. Conclusion: The difference is small so there isn't a need except for this (very specific) case.

Manually maintaining an array would work, but you would have to repeat the boilerplate code for conversion from key to index and vice versa. Additionally you would have to write all methods (such as containsKey) yourself.

I meant having an implementation such as the linked Nat2IntArrayMap (compared to having a generic key to index mapping).

My point is (summarizing the above two replies) that the use cases where having an index mapping + bit set is superior to both (i) hashing and (ii) directly using the mapped index as key is (I think) very rare.

Correct me if I am wrong: As I see it, the efficiency of this approach rests upon the assumption that the set of keys (or rather: their mapped indices) your application encounters is (i) significantly smaller than the domain of the keys and (ii) known a-priori. If (i) is not true, directly mapping in an array would be faster, if (ii) is not true, a hash-based approach is faster.

BitSet contains logic for resizing which is not necessary here, therefore I went for a manual bit set. But that is probably premature optimization; using BitSet would work here as well and would indeed simplify some tasks.

Agree with both points. What I was aiming at is that BitSet has a non-trivial nextSetBit function, which would be useful for iterators.

but I thought a more general solution (as proposed here) might be helpful.

I see your point, I just do not think there is a common application where this might be the canonical solution. Generality often comes at a cost, even if it is hidden sometimes :-)

But definitely, I agree, if more people are interested my main concern directly is invalidated and it might make sense to include this.