HashSet accidentally converted to a list

pkolaczk commented 9 years ago

import com.google.common.collect.{Maps, Sets}
import net.razorvine.pickle.{Unpickler, Pickler}

object PickleTest extends App {

  val set1 = Sets.newHashSet("value1", "value2")
  val map1 = Maps.newHashMap[String, Any]
  map1.put("pkey", 1)
  map1.put("s", set1)

  val set2 = Sets.newHashSet("value1", "value2")
  val map2 = Maps.newHashMap[String, Any]
  map2.put("pkey", 2)
  map2.put("s", set2)

  val obj = Array(map1, map2)
  val pickler = new Pickler
  val array = pickler.dumps(obj)

  val unpickler = new Unpickler
  val result = unpickler.loads(array).asInstanceOf[Array[_]]
  println(result(0).asInstanceOf[java.util.HashMap[String, Any]].get("s").getClass)   // HashSet  (ok)
  println(result(1).asInstanceOf[java.util.HashMap[String, Any]].get("s").getClass)   // ArrayList (wrong!)
}

pkolaczk commented 9 years ago

The bug is on the writing (dumping) side, because I can see the Set appear incorrectly as a list on the Python side as well.

irmen commented 9 years ago

Try setting Config.SERPENT_SET_LITERALS = true . It's set to false to be compatible with Python < 3.2, where set literals are not understood by the ast module, and will be serialized as tuples instead. See https://github.com/irmen/Serpent/blob/master/java/src/net/razorvine/serpent/Serializer.java#L231

pkolaczk commented 9 years ago

Is it going to work with Python 2.7?

irmen commented 9 years ago

No. To use set literals you will have to use Python 3.2 or newer. This is a limitation of the ast module of python itself, hence the configuration setting to enable/disable this in Serpent.

pkolaczk commented 9 years ago

Wait, I don't understand something. In Python 2.7, Sets seem to be supported, because the first set gets converted properly and converts back to either a Python set or Java HashSet. So it is possible. Yet, the second and all the next sets in the complex object are not. How does it know the first set should be unpicked to a set and not array or a list? The type information is definitely there.

Can you tell us why the first one gets converted ok and the rest do not, and why can't it be fixed in Python 2.7?

irmen commented 9 years ago

Right, if one set is converted and the other not, there's definitely something strange going on.

In any case, what happens when you set Config.SERPENT_SET_LITERALS = true ?

Also, pyrolite/serpent don't have built in support for Guava types, what happens when you replace the guava set type with regular jdk set ?

Finally, as I don't have a Scala compiler installed, could you please convert the code to a java equivalent that permits me to easily reproduce the issue?

pkolaczk commented 9 years ago

This is a regular JDK HashSet, Guava is used only as a convenience factory method here. See the docs: http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/Sets.html#newHashSet(E...)

pkolaczk commented 9 years ago


import com.google.common.collect.Maps;
import com.google.common.collect.Sets;
import net.razorvine.pickle.Unpickler;
import net.razorvine.pickle.Pickler;

import java.io.IOException;
import java.util.*;

class PickleTest {

    public static void main(String[] args) throws IOException
    {
        Set<String> set1 = Sets.newHashSet("value1", "value2");
        Map<String, Object> map1 = Maps.newHashMap();
        map1.put("pkey", 1);
        map1.put("s", set1);

        Set<String> set2 = Sets.newHashSet("value1", "value2");
        Map<String, Object> map2 = Maps.newHashMap();
        map2.put("pkey", 2);
        map2.put("s", set2);

        Object obj = new Object[]{ map1, map2 };
        Pickler pickler = new Pickler();
        byte[] array = pickler.dumps(obj);

        Unpickler unpickler = new Unpickler();
        Object[] result = (Object[]) unpickler.loads(array);
        System.out.println(((HashMap<String, Object>) result[0]).get("s").getClass());
        System.out.println(((HashMap<String, Object>) result[1]).get("s").getClass());
    }
}

irmen commented 9 years ago

Right, thanks for that code. I can now reproduce the problem. Earlier, I didn't look too carefully at it and assumed it had something to do with the serpent serializer which is the default, but, you are using pickle explicitly.

The problem is with the memoization code. If you construct a pickler without memoization, the result is correct. With memoization (default=on) it somehow produces a faulty pickle. I wrote the two resulting pickles to a file and loaded them in python itself:

>>> import pickle
>>> pickle.load(open("withoutmemo.pickle","rb"))
({'s': {'value1', 'value2'}, 'pkey': 1}, {'s': {'value1', 'value2'}, 'pkey': 2})
>>> pickle.load(open("withmemo.pickle","rb"))
({'s': {'value1', 'value2'}, 'pkey': 1}, {'s': ['value2', 'value1'], 'pkey': 2})

will investigate what goes wrong in the memoization code.

In the meantime, use: new Pickler(false) to switch off the memoization.

irmen commented 9 years ago

The example code can be simplified even more, pickling the following already replicates the problem:

        Set<String> set1 = new HashSet();
        Set<String> set2 = new HashSet();
        Object obj = new Object[] {set1, set2};

pkolaczk commented 9 years ago

Thanks, that's a useful workaround. Does turning off memoization have severe performance impact?

irmen commented 9 years ago

That totally depends on the structure of the data. If you have a lot of objects that reoccur multiple times in the data you're serializing, then yes-- it will cause a lot of object duplication on the deserialization side.

I'm fixing the issue as we speak though, so you won't have to use the workaround for long. Need to translate the fix to the C# implementation still.

irmen / Pyrolite

HashSet accidentally converted to a list #14