ninia / jep

Embed Python in Java
Other
1.28k stars 145 forks source link

Auto Serialization of PyObject #464

Open prabhuk12 opened 1 year ago

prabhuk12 commented 1 year ago

I am a heavyuser of JEP and thank you for everything this community does. I have been working with the 3.8 version of jep and recently upgraded to 4.1.1. In 4.1.1 all the collection constituent objects have been replaced with PyObject instead of the primitive java type like the int or string etc. As an example df.dtype.tolist() gives a response of ArrayList where the items are PyObject. If we are trying to access this in a different thread, that gives thread access error. If we try to serialize this while arraylist is serializable, the PyObject is not and it fails. My question is a. Is there an easy way to specify to jep to break the PyObject bindings and bring it entirely into the JVM world without any JNI b. Is there an automatic conversion I can tell JEP to say, get this pyobject "as" native instead of iterating through individual elements c. Is there any method available in PyObject where it will automatically convert it into the appropriate java type ?

bsteffensmeier commented 1 year ago

In 4.1.1 all the collection constituent objects have been replaced with PyObject instead of the primitive java type like the int or string etc.

"Primitive" python types like long, float, boolean, string, list, map, buffer and None will be converted to the corresponding java types when the enclosing collection is converted. If there is a case where these types are being converted to a PyObject instead it is possibly a bug so please post a reproducible code snippet so we can determine if a fix is needed.

a. Is there an easy way to specify to jep to break the PyObject bindings and bring it entirely into the JVM world without any JNI

No, we cannot possibly have corresponding java classes for every possible python type so you will need to convert your object into a type that does have a mapping to a java type, either on the python side by constructing java objects or collections of primitives or on the java side by using PyObject to extract the values you need.

b. Is there an automatic conversion I can tell JEP to say, get this pyobject "as" native instead of iterating through individual elements

There is not currently any way of controlling the behavior of jep conversion on items within a collection.

c. Is there any method available in PyObject where it will automatically convert it into the appropriate java type ?

We already do this for all the types with a straightforward mapping, any types that are not converted don't have a well defined appropriate java type. If there are types in python that have an obious 1-to-1 mapping with a java class then we can consider adding automatic conversions.

prabhuk12 commented 1 year ago

Ben - Thank you so much for the immediate reply. Much appreciated. Assume df = pd.read_csv("csv_data")

Command - df.dtypes.tolist() Now when we try to do a df.dtypes.tolist() in 3.8, this would give an arraylist of strings. Now it gives an ArrayList of PyObjects. When you try to serialize the PyObject is not serializable

Also what is an easy way to specify how to map python to java / vice ve. There is no clear FAQ / article I was able to get to. Does this help clarify more on my question ?

bsteffensmeier commented 1 year ago

In jep 3.x when a Python Object is converted to a Java Object and no better conversion is found then it will be converted to a java.lang.String. In jep 4.x this behavior has been changed so the conversion of last resort is jep.python.PyObject.. The reason for this change is because for most python objects the string version is not very useful compared to the PyObject which provides the ability to access attributes, pass it back into jep, or convert it to an appropriate java object.

This table in the javadoc gives a full list of all the python to java conversions jep currently supports. Some jep functions that retrieve values from python take a Class argument and this can be used to control the conversion but there is currently no way for fine grained control of objects within a collection.

It looks like you want to convert a list of pandas/numpy datatype objects into a java.util.List of strings, jep considers datattype a complex type so it will be converted to a PyObject when it is in a Collection. You can control the conversion more if you can use an array instead of a List, for example: String[] dtyeps = interp.getValue("df.dtypes.tolist()", String[].class).

prabhuk12 commented 1 year ago

Thank you again. I moved back to 3.9 since I wanted to get some of the pieces working again. I will try to get the latest jep and try this again. Instead of specific case for getValue("expression", "type") is there anyway to tell jep to replicate the behaviour of 3.x ? i.e. if it is not able to see something that is mappable, serialize and send it ?

If I understand this right, I am going to force everything to string here, which is really not what I want to do. Instead I want to tell it, hey if you dont find a proper map only then move this guy to string instead of giving me a serialization issue. This is not desirable. If it only does string for PyObject, that is fantastic.

As a follow on.. in order to get the appropriate type otherwise, I have to inspect pyobject to see what type it is. There is no salience on the PyObject that tells me what type it is.

Does what I am asking make sense ?

bsteffensmeier commented 1 year ago

...is there anyway to tell jep to replicate the behaviour of 3.x ?

No

If I understand this right, I am going to force everything to string here, which is really not what I want to do. Instead I want to tell it, hey if you dont find a proper map only then move this guy to string instead of giving me a serialization issue. This is not desirable. If it only does string for PyObject, that is fantastic.

You may need to make multiple calls between Python and java to extract the values you want as the types you need.

As a follow on.. in order to get the appropriate type otherwise, I have to inspect pyobject to see what type it is. There is no salience on the PyObject that tells me what type it is.

You can use any Python function from Java so you can inspect it the same way in java that you would in python, using things like type() and isinstance()

Does what I am asking make sense ?

I understand that for your use case the behavior of default conversion behavior of Jep 3.x is preferred to Jep 4.x. However Jep no longer supports String conversion as the default behavior. My experience with Jep has been that using String as the default conversion is only useful for the simpelst objects and it is confusing in real world use cases invlolving complex objects.

I have found that most developers using Jep need to add logic before or after the Jep Java<->Python conversions to get the values in the format they need, Jep simply cannot properly handle every application specific use case. My understanding of your problem is that you need to convert a python list of datatypes to a java list of Strings and that seems like a typical use case where some conversion code is necessary. I have written some code below that provides a couple different mechanisms for doing the conversions or inspecting the PyObjects, hopefully something in there helps you.

public static void main(String[] args) throws JepException {
    try (SharedInterpreter interpreter = new SharedInterpreter()) {
        // Setup a list of dtypes in python
        interpreter.exec("import numpy");
        interpreter.exec("dtypes = [numpy.dtype(numpy.float64), numpy.dtype(numpy.int32)]");

        // Convert Python List of dtypes into Java String[]
        String[] strDtypes = interpreter.getValue("dtypes", String[].class);
        System.out.println("Dtypes as String[] = " + Arrays.toString(strDtypes));

        // Convert Python List of dtypes into Java List of Strings
        List strDtypesList = interpreter.getValue("[str(t) for t in dtypes]", List.class);
        System.out.println("Dtypes as List<String> = " + strDtypesList);

        // Get some Python things in java to inspect each dtype
        PyCallable type = interpreter.getValue("type", PyCallable.class);
        PyCallable isinstance = interpreter.getValue("isinstance", PyCallable.class);
        PyObject dtype = interpreter.getValue("numpy.dtype", PyObject.class);

        // Get the dtypes as pyObject and inspect them
        List pyDtypesList = interpreter.getValue("dtypes", List.class);
        for(int i = 0 ; i < pyDtypesList.size() ; i += 1) {
            PyObject unknown = (PyObject) pyDtypesList.get(i);
            Boolean isDtype = isinstance.callAs(Boolean.class, unknown, dtype);
            System.out.println("isinstance(unknown" + i + ", dtype) = " + isDtype);
            PyObject pyType = type.callAs(PyObject.class, unknown);
            System.out.println("type(unknown" + i + ") = " + pyType);
            if( isDtype ) {
                System.out.println("unknown" + i + ".toString() = " + unknown.toString());
            }
        }
        // PyCallable gets a bit hard to follow, use PyObject.proxy with a custom interface to
        // More clearly expose the functionality of a PyObject
        interpreter.exec("import builtins");
        PyBuiltins builtins = interpreter.getValue("builtins", PyObject.class).proxy(PyBuiltins.class);
        for(int i = 0 ; i < pyDtypesList.size() ; i += 1) {
            PyObject unknown = (PyObject) pyDtypesList.get(i);
            boolean isDtype = builtins.isinstance(unknown, dtype);
            System.out.println("isinstance(unknown" + i + ", dtype) = " + isDtype);
            PyObject pyType = builtins.type(unknown);
            System.out.println("type(unknown" + i + ") = " + pyType);
            if( isDtype ) {
                System.out.println("unknown" + i + ".toString() = " + unknown.toString());
            }
        }
    }
}

public static interface PyBuiltins {
    public boolean isinstance(PyObject obj, PyObject cls);

    public PyObject type(PyObject obj);
}

Running this results in the following output:

Dtypes as String[] = [float64, int32]
Dtypes as List<String> = [float64, int32]
isinstance(unknown0, dtype) = true
type(unknown0) = <class 'numpy.dtype[float64]'>
unknown0.toString() = float64
isinstance(unknown1, dtype) = true
type(unknown1) = <class 'numpy.dtype[int32]'>
unknown1.toString() = int32
isinstance(unknown0, dtype) = true
type(unknown0) = <class 'numpy.dtype[float64]'>
unknown0.toString() = float64
isinstance(unknown1, dtype) = true
type(unknown1) = <class 'numpy.dtype[int32]'>
unknown1.toString() = int32