irmen / Pyrolite

Java and .NET client interface for Pyro5 protocol
MIT License
178 stars 47 forks source link

java: ByteArrayConstructor should graciously handle existing byte arrays #36

Closed vruusmann closed 8 years ago

vruusmann commented 8 years ago

The method ByteArrayConstructor#construct(Object[]) fails with a ClassCastException if the argument is already a byte array. This situation should be handled by introducing an instanceof type check:

if(args.length == 1){
  if(args[0] instanceof byte[]){
    return args[0];
  }
  // Proceed as usual
}

I've come across pickled xgboost.sklearn.XGBRegressor object instances (using Python 3.4) where the builtins.bytearray is already a byte array. For example, the following commit contains such joblib dump file XGBAuto.pkl: https://github.com/jpmml/jpmml-sklearn/commit/bc022a3c6353a4eec5f377e6937237c73334c445

irmen commented 8 years ago

This is strange, but easy enough to fix. I have no clue what the "joblib dump file" is you're referencing. It's not a pickle (python fails to unpickle it) and 'file' doesn't recognize it either. ?

vruusmann commented 8 years ago

This is Scikit-Learn's recommended model serialization approach:

from sklearn.externals import joblib

joblib.dump(obj, "XGBAuto.pkl", compress = 9)

Essentially, XGBAuto.pkl is a GZip-compressed pickle file.

irmen commented 8 years ago

I still can't open the file, gzip doesn't like it and python's gzip.open also reports that it is not a gzipped file. :confused: Would have been nice to have a test file that I can use.

By the way what exactly do you mean by:

object instances (using Python 3.4) where the builtins.bytearray is already a byte array.

builtins.bytearray is the python builtin bytearray type, so I don't quite get what you are saying here?

vruusmann commented 8 years ago

Sorry, my bad - it appears to be a Zlib compressed pickle file, with a custom header (hence it's not recognized by the file tool).

Attached is a regular pickle file XGBAuto.pkl (you need to remove the .zip filename extension manually - GitHub wouldn't let me to upload it without a "supported" file extension). This file was created using the following code (Python 3.4):

import pickle

con = open("XGBAuto.pkl", "wb")
pickle.dump(obj, con, protocol = -1)
con.close()

When I try to unpickle it, then I get the following ClassCastException:

Exception in thread "main" java.lang.ClassCastException: [B cannot be cast to java.util.ArrayList
        at net.razorvine.pickle.objects.ByteArrayConstructor.construct(ByteArrayConstructor.java:24)
        at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
        at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
        at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
        at org.jpmml.sklearn.PickleUtil.unpickle(PickleUtil.java:213)

builtins.bytearray is the python builtin bytearray type, so I don't quite get what you are saying here?

Pyrolite 4.11 uses ByteArrayConstructor for handling __builtin__.bytearray, __builtin__.bytes, builtins.bytearray and _codecs.encode Python types. In my XGBAuto case, I had to override the mapping for the builtins.bytearray Python type, and not for others. Perhaps you know better if it will be necessary to override the mapping for other Python types as well.

XGBAuto.pkl.zip

vruusmann commented 8 years ago

Interesting fact - when I re-run exactly the same code using Python 2.7 (instead of Python 3.4), then the XGBAuto.pkl file can be unpickled without problems.

Please see the attached pickle file XGBAuto-v27.pkl (again, you need to remove the .zip filename extension manually).

XGBAuto-v27.pkl.zip

irmen commented 8 years ago

Thanks, that helped. I can now reproduce the problem using a tiny pickle file that I made myself in Python 3.5. Somehow this particular pickling of bytearrays went unnoticed.

irmen commented 8 years ago

Can you let me know if this resolves your issue? I'll make an official new release if that is the case.

vruusmann commented 8 years ago

I can confirm that the current Pyrolite version 4.12-SNAPSHOT works fine with my tests.

irmen commented 8 years ago

new versions have been relased to maven and nuget.