Closed vruusmann closed 8 years ago
This is strange, but easy enough to fix. I have no clue what the "joblib dump file" is you're referencing. It's not a pickle (python fails to unpickle it) and 'file' doesn't recognize it either. ?
This is Scikit-Learn's recommended model serialization approach:
from sklearn.externals import joblib
joblib.dump(obj, "XGBAuto.pkl", compress = 9)
Essentially, XGBAuto.pkl
is a GZip-compressed pickle file.
I still can't open the file, gzip doesn't like it and python's gzip.open also reports that it is not a gzipped file. :confused: Would have been nice to have a test file that I can use.
By the way what exactly do you mean by:
object instances (using Python 3.4) where the builtins.bytearray is already a byte array.
builtins.bytearray is the python builtin bytearray type, so I don't quite get what you are saying here?
Sorry, my bad - it appears to be a Zlib compressed pickle file, with a custom header (hence it's not recognized by the file
tool).
Attached is a regular pickle file XGBAuto.pkl
(you need to remove the .zip
filename extension manually - GitHub wouldn't let me to upload it without a "supported" file extension). This file was created using the following code (Python 3.4):
import pickle
con = open("XGBAuto.pkl", "wb")
pickle.dump(obj, con, protocol = -1)
con.close()
When I try to unpickle it, then I get the following ClassCastException
:
Exception in thread "main" java.lang.ClassCastException: [B cannot be cast to java.util.ArrayList
at net.razorvine.pickle.objects.ByteArrayConstructor.construct(ByteArrayConstructor.java:24)
at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
at org.jpmml.sklearn.PickleUtil.unpickle(PickleUtil.java:213)
builtins.bytearray is the python builtin bytearray type, so I don't quite get what you are saying here?
Pyrolite 4.11 uses ByteArrayConstructor
for handling __builtin__.bytearray
, __builtin__.bytes
, builtins.bytearray
and _codecs.encode
Python types. In my XGBAuto
case, I had to override the mapping for the builtins.bytearray
Python type, and not for others. Perhaps you know better if it will be necessary to override the mapping for other Python types as well.
Interesting fact - when I re-run exactly the same code using Python 2.7 (instead of Python 3.4), then the XGBAuto.pkl
file can be unpickled without problems.
Please see the attached pickle file XGBAuto-v27.pkl
(again, you need to remove the .zip
filename extension manually).
Thanks, that helped. I can now reproduce the problem using a tiny pickle file that I made myself in Python 3.5. Somehow this particular pickling of bytearrays went unnoticed.
Can you let me know if this resolves your issue? I'll make an official new release if that is the case.
I can confirm that the current Pyrolite version 4.12-SNAPSHOT
works fine with my tests.
new versions have been relased to maven and nuget.
The method
ByteArrayConstructor#construct(Object[])
fails with aClassCastException
if the argument is already a byte array. This situation should be handled by introducing aninstanceof
type check:I've come across pickled
xgboost.sklearn.XGBRegressor
object instances (using Python 3.4) where thebuiltins.bytearray
is already a byte array. For example, the following commit contains such joblib dump fileXGBAuto.pkl
: https://github.com/jpmml/jpmml-sklearn/commit/bc022a3c6353a4eec5f377e6937237c73334c445