jerrinot / subzero

SubZero - Fast Serialization for Hazelcast
Apache License 2.0
50 stars 13 forks source link

SubZero across different clusters #31

Open PotatoSpud opened 5 years ago

PotatoSpud commented 5 years ago

Using SubZero across different Hazelcast clusters has problems. The serialization ids used on one cluster will not be consistent with that of another. I have got around this issue by creating my own KryoStrategy's that use an id that is generated from the fully qualified classname.
It gets tricky for two reasons though:

  1. Templated classes can be a challenge but there is a way.
  2. Need to avoid those ids that Hazelcast uses internally.

I can attempt a fork, if there is an interest in this solution.

Best regards Aongus

jerrinot commented 5 years ago

hi, what's your about the strategy to generate unique IDs from classnames?

PotatoSpud commented 5 years ago

This is not perfect as the hashs may not be completely unique. However, the chances of a clash are very low. So I added a new version of TypedKryoStrategy and GlobalKryoStrategy as follows:

public class IndigoTypedKryoStrategy<T> extends KryoStrategy<T> {

    private final Class<T>       clazz;
    private final UserSerializer userSerializer;

    public IndigoTypedKryoStrategy(final Class<T> clazz, final UserSerializer registrations) {
        this.clazz = clazz;
        this.userSerializer = registrations;
    }

    @Override
    public void registerCustomSerializers(final Kryo kryo) {
        this.userSerializer.registerSingleSerializer(kryo, this.clazz);
    }

    @Override
    void writeObject(final Kryo kryo, final Output output, final T object) {
        kryo.writeObject(output, object);
    }

    @Override
    T readObject(final Kryo kryo, final Input input) {
        return kryo.readObject(input, this.clazz);
    }

    @Override
    public int newId() {
        return HashUtil.serializionIdHash(this.clazz.getName());
    }
}
public class IndigoGlobalKryoStrategy<T> extends KryoStrategy<T> {
    private final UserSerializer userSerializer;
    private final int            id;
    private static final String  GLOBAL = "global";

    public IndigoGlobalKryoStrategy(final UserSerializer registrations) {
        this.userSerializer = registrations;
        String identifier = GLOBAL;
        try {
            final Type sooper = this.getClass().getGenericSuperclass();
            final Type t = ((ParameterizedType) sooper).getActualTypeArguments()[0];
            identifier = t.getTypeName();
        } catch (final Exception e) { /** fall through */
        }
        this.id = HashUtil.serializionIdHash(identifier);
    }

    @Override
    public void registerCustomSerializers(final Kryo kryo) {
        this.userSerializer.registerAllSerializers(kryo);
    }

    @Override
    void writeObject(final Kryo kryo, final Output output, final T object) {
        kryo.writeClassAndObject(output, object);
    }

    @SuppressWarnings("unchecked")
    @Override
    T readObject(final Kryo kryo, final Input input) {
        return (T) kryo.readClassAndObject(input);
    }

    @Override
    public int newId() {
        return this.id;
    }
}

The MurmurHash3_x86_32 algo was lifted from Hazelcast itself but any decent hash would do the work:

public class HashUtil {
    public static int serializionIdHash(final String text) {
        final byte[] bytes = text.getBytes();
        int hash = HashUtil.MurmurHash3_x86_32(bytes, 0, bytes.length);
        // Avoid Hazelcast's internal registrations and our own space
        if ((hash > -400) && (hash < 100)) {
            hash += 500;
        }
        return hash;
    }
}

Hope this helps Aongus

jerrinot commented 5 years ago

@PotatoSpud: I am not crazy about the probabilistic nature of this. It smells like a birthday paradox to me - the chance of a conflict increases quite fast as the number of classes is growing.

Is there any better way? Maybe a strategy with hard-coded IDs for well-known classes (think of JDK classes) and then a combination of:

  1. Explicit ID configuration for custom domain classes
  2. using FQDN for classes without explicit ID assignment

Any other idea?

PotatoSpud commented 5 years ago

@jerrinot: Agreed, it is bound to create problems using my above approach.

For your suggestions:

  1. This may help reduce the string length and significantly reducing the possibility of a clash. Is this what you had in mind?
  2. Not sure how this would work exactly. Conceptually FQDN should be fine.

If you can get away from explicit ID(int) assignment, you are half way home. I understand more clearly how the serialization works, it is the de-serialization that is confounding. The class names must be presented again I assume so that subsequent IDs can be understood.