jpype-project / jpype

JPype is cross language bridge to allow Python programs full access to Java class libraries.
http://www.jpype.org
Apache License 2.0
1.09k stars 180 forks source link

Typecasting of string arrays #953

Open max3-2 opened 3 years ago

max3-2 commented 3 years ago

Hi,

Im trying to typecast a string matrix to java (2D array) - either in byte or object mode. Im trying to adapt to my numerical value casting which work without a datatype but can't seem to figure this one out - is it even possible? Im either getting not a primitive datatype or no type converter found

MWE

# jpype typecast tester for arrays
import logging
import numpy as np
import jpype
import jpype.types as jtypes

logger = logging.getLogger(__name__)

# Start the Java virtual machine.
jpype.startJVM()

# 2-D
test_int_matrix = np.ones((10, 10), dtype=int)
test_float_matrix = np.ones((10, 10), dtype=float)
test_bool_matrix = np.ones((10, 10), dtype=bool)
test_string_matrix = np.array([
    f'string_{i}' for i in range(100)]).reshape((10, 10))
test_string_object_matrix = test_string_matrix.astype(object)

# Start the tests
passed = True

#typecast 2D
try:
    print(jtypes.JArray.of(test_int_matrix))
except Exception as e:
    logger.exception(f'Test failed with: {e}')
    passed = False
try:
    print(jtypes.JArray.of(test_float_matrix))
except Exception as e:
    logger.exception(f'Test failed with: {e}')
    passed = False
try:
    print(jtypes.JArray.of(test_bool_matrix))
except Exception as e:
    logger.exception(f'Test failed with: {e}')
    passed = False
try:
    print(jtypes.JArray.of(test_string_matrix, dtype=jtypes.JChar))
except Exception as e:
    logger.exception(f'Test failed with: {e}')
    passed = False
try:
    print(jtypes.JArray.of(test_string_object_matrix, dtype=jtypes.JObject))
except Exception as e:
    logger.exception(f'Test failed with: {e}')
    passed = False
Thrameos commented 3 years ago

The two tests that are failing are currently correct though it would be possible to add an enhancement to support those constructs. The reason the JChar fails is that JChar is generally considered to be an encoded type and there is a significant risk that someone would try to transfer unencoded data into JChar accidentally which would lead to errors. Thus JChar is not a number type intentionally.

As for the second the conversion to JObject is not specific enough (it would need to be JString) and is not a primitive type. Currently the JArray.of is performing a bulk transfer by looking up in a dictionary of predefined bulk transfer converter functions. Basically it maps the numpy array into memory, creates a target array of the appropriate size, and then calls primitive converter to push over every element. We could add a hook to check to see if there is a Python conversion function before failing. However, this would defeat a good amount of the speed optimizations as we have to pack each element into tuple, call the conversion function in Python then unpack the results.

It does make some sense to support such an option if only so we can give a more optimized path at a future point, but I will have to wish list this for now.

max3-2 commented 3 years ago

Thanks for the quick reply. I think the object case is not so important, but creating a string matrix would be nice in some cases.

Is there any workaround right now to get a 2D array of string (dtype starts with <U) to java? Maybe building each row and then stacking them? The conversion to arrays (1D) is working with JArray(JString)(npa). Or going over bytes conversion using str.encode()

Thrameos commented 3 years ago

I would have to look it over to see if it would be possible. To keep from requiring numpy in jpype, I can only support operations that are accessible from within the Python buffer protocol. Numpy arrays are a superset of buffer protocol so there are a lot of bulk operations that numpy does that I can't replicate. But it the type is one that I can recognize using the buffer protocol, then we certainly can add it as an enhancement in the future.

Thrameos commented 3 years ago

I modified your example by calling memoryviw(X).format to see what the internal API is seeing for these objects. The format for strings example is 9w which is likely 9 wide chars. If I change the string length then the number changes accordingly. The string object is type O. Those are both outside of the standard specifications for Python buffers that we currently support.

https://docs.python.org/3/library/struct.html#format-characters

That doesn't mean it isn't possible, but simply that we would have to add logic to recognize those types of structures to make it work. Object is much more challenging as it is also the type used for ragged arrays. The converter would have to interrogate each type object to see if conversion is possible first.

max3-2 commented 3 years ago

So a quick adaption of the tests leave me a bit stupid: Using the same approach on 1D arrays works, either using JArray(JString) or JArray(JObject)

Is this due to memory order? Couldn’t this be leveraged? Like disassembling the array into rows on python, converting them and then reassembling into 2D on java? This would create a copy though...

max3-2 commented 3 years ago

FWIW, when memory is not an issue the two below work - however a reference to the initial array is lost

# JStrings work with object and numpy str dtype
try:
    res = jtypes.JString[test_string_matrix.shape]
    for i, row in enumerate(test_string_matrix):
        for j, col in enumerate(row):
            res[i][j] = col
    print(res.length)
    print(res[5].length)
    print(res[5][5])
except Exception as e:
    logger.exception(f'Test failed with: {e}')
    passed = False
try:
    res = jtypes.JString[test_string_object_matrix.shape]
    for i, row in enumerate(test_string_object_matrix):
        for j, col in enumerate(row):
            res[i][j] = col
    print(res.length)
    print(res[5].length)
    print(res[5][5])
except Exception as e:
    logger.exception(f'Test failed with: {e}')
    passed = False