rstudio / reticulate

R Interface to Python
https://rstudio.github.io/reticulate
Apache License 2.0
1.68k stars 328 forks source link

array C to F translation with string arrays seemingly broken in python3 #707

Open t-kalinowski opened 4 years ago

t-kalinowski commented 4 years ago

This is the (I think correct) behavior in python2:

Sys.setenv(RETICULATE_PYTHON= Sys.which("python2"))

library(tensorflow)
x <- tf$constant(as.character(1:12), shape = shape(3, 4))
x$numpy()
#>      [,1] [,2] [,3] [,4]
#> [1,] "1"  "2"  "3"  "4" 
#> [2,] "5"  "6"  "7"  "8" 
#> [3,] "9"  "10" "11" "12"

Created on 2020-01-30 by the reprex package (v0.3.0)

Sys.setenv(RETICULATE_PYTHON= Sys.which("python3"))

library(tensorflow)
x <- tf$constant(as.character(1:12), shape = shape(3, 4))
x$numpy()
#>      [,1] [,2] [,3] [,4]
#> [1,] ?    ?    ?    ?   
#> [2,] ?    ?    ?    ?   
#> [3,] ?    ?    ?    ?

py_tensor2character <- reticulate::py_eval(
  "lambda tensor: tensor.numpy().astype(str)")

py_tensor2character(x)
#>      [,1] [,2] [,3] [,4]
#> [1,] "1"  "4"  "7"  "10"
#> [2,] "2"  "5"  "8"  "11"
#> [3,] "3"  "6"  "9"  "12"

Created on 2020-01-30 by the reprex package (v0.3.0)

Note, the order of the elements in python3 is not what I would expect.

btw, congrats on becoming a PBC!

skeydan commented 4 years ago

Thanks!! (although that was probably directed to @jjallaire :-))))

and thanks for spotting that! Probably bytestring -related because before conversion it's

> x$numpy
<bound method _EagerTensorBase.numpy of <tf.Tensor: shape=(3, 4), dtype=string, numpy=
array([[b'1', b'2', b'3', b'4'],
       [b'5', b'6', b'7', b'8'],
       [b'9', b'10', b'11', b'12']], dtype=object)>>

trying to look at it later today or early next week!

kevinushey commented 4 years ago

Looks like @skeydan's gut was right:

library(reticulate)
library(tensorflow)

data <- list(
  list("a", "b", "c"),
  list("d", "e", "f"),
  list("g", "h", "i")
)

tf <- import("tensorflow", convert = FALSE)
t <- tf$constant(data, shape = shape(3, 3))$numpy()

py_to_r(t)
#>      [,1] [,2] [,3]
#> [1,] ?    ?    ?   
#> [2,] ?    ?    ?   
#> [3,] ?    ?    ?
py_to_r(t$astype("str"))
#>      [,1] [,2] [,3]
#> [1,] "a"  "d"  "g" 
#> [2,] "b"  "e"  "h" 
#> [3,] "c"  "f"  "i"
py_to_r(t$astype("bytes"))
#>      [,1] [,2] [,3]
#> [1,] "a"  "d"  "g" 
#> [2,] "b"  "e"  "h" 
#> [3,] "c"  "f"  "i"

Created on 2020-01-31 by the reprex package (v0.3.0)

I think the issue is independent of C-style versus Fortran-style array ordering, though. Rather, I think something about the way bytestrings are stored in these TensorFlow objects is confusing reticulate.

skeydan commented 4 years ago

The problem seems to be TF tensors' numpy method, which convert Unicode strings (https://www.tensorflow.org/tutorials/load_data/unicode) to NumPy objects:

>>> tf.constant([u"You're", u"welcome!"]).numpy().dtype
dtype('O') # object

According to this issue

https://github.com/tensorflow/tensorflow/issues/34871

this is "expected behavior" (?).

Thinking about workarounds from the user side, @kevinushey's code above would constitute one but as far as I can see, require importing tf with convert = FALSE (as I don't see how to achieve the same result on statement level here, with tf$constant involved...? But perhaps I'm missing sth)

From the reticulate side, I wonder if in addition to is_python_string

https://github.com/rstudio/reticulate/blob/9dedfb3fa043ed5f035d6a1ee162cc9ee3b1c64e/src/python.cpp#L260

we should have another function that checks if the object could be converted to unicode?

kevinushey commented 4 years ago

Do we have any indication that this causes any more substantial issues (beyond conversion from Python to R)? In other words, do models fit in TensorFlow using these objects still produce the expected predictions, and so on?

t-kalinowski commented 4 years ago

it looks like the R to python conversion works fine, and it's only on the way back that this bug shows it's head. So it'll depend on the workflow and if the preprocessing pipeline make data go between tf -> R -> tf at least once. I ran into this because in serving a trained model, I was using a tf op to process the raw probs into top-k class labels as strings. Those were showing up wrong after switching to python3, making my model look waaaaay worse than it actually is. To be clear, the model didn't change, but the model outputs as they presented in R changed just by switching to python3.

skeydan commented 4 years ago

That sounds like we may want to do sth here?

@kevinushey what do you think about my suggestion above?

kevinushey commented 4 years ago

My main worry is that this could be expensive. Unless I'm misunderstanding, we'd have to check each element of the numpy array individually to confirm that each element is indeed a "string" (or can be converted to string).

I wonder if there's some better way forward?

t-kalinowski commented 4 years ago

Would it be possible to do what you did in python2, just convert PyBytes to R strings? This seems to work right now from the R side:

as.character(py_eval("b'foo'"))
skeydan commented 4 years ago

Yes but after the call to numpy(), the array elements will be NumPy objects (see https://github.com/rstudio/reticulate/issues/707#issuecomment-581362641 above), and not easily recognized as strings.

I don't know why, but PyUnicode_Check returns false on them... In fact, although I'm not sure this would be the correct thing to do (or would need further consideration what to do on the R side, at least) I just wanted to add a check for PyBytes as you suggest ... but strangely, even though

int PyBytes_Check(PyObject *o)

has been existing since 3.5 at least, I get undefined symbol when loading Python...

Same for the macro approach which we're following here

https://github.com/rstudio/reticulate/blob/aae44af7d4e800293316c54857cf6818998a577d/src/libpython.h#L157

If I try the same for PyBytesObject I get undefined symbol as well...

Would you have an idea what could be going on there @kevinushey ?

kevinushey commented 4 years ago

To answer these questions, you usually need to look into the Python source code:

https://github.com/python/cpython/blob/58f4e1a6ee4c6ea82f3f5075d9d9d344ce6b8a56/Include/bytesobject.h#L47-L49

Because those functions are defined as macros, they cannot be loaded from the Python library. We accommodate some of these sorts of macros here:

https://github.com/rstudio/reticulate/blob/aae44af7d4e800293316c54857cf6818998a577d/src/libpython.h#L124-L142

We might be able to load and check against the PyBytes_Type, as it's defined here:

https://github.com/python/cpython/blob/58f4e1a6ee4c6ea82f3f5075d9d9d344ce6b8a56/Objects/bytesobject.c#L2807-L2848

skeydan commented 4 years ago

Oh, I see, thanks! Got it to work, but PyBytes_Check returns false (and also the NumPy type is 17 == object)...

So I guess the only remaining option would be the convertibility check - and how much this would affect performance should depend on how many cases of non-string-objects we'd have... meaning, objects that fall through all tests in

https://github.com/rstudio/reticulate/blob/9dedfb3fa043ed5f035d6a1ee162cc9ee3b1c64e/src/python.cpp#L260

(((I don't really have an idea how often that function would actually return false...)))

BTW I still think TF's numpy() should not return NumPy objects in these cases, and I don't see how it's "intended behavior", as concluded in the TF issue I linked above...

t-kalinowski commented 4 years ago

Quick update, this is still an issue. I though I could work around it by converting bytes to str in python3, but that didn't help.

library(tensorflow)

decode <- py_run_string(
"import numpy as np
_decode_ufunc = np.vectorize(lambda z: z.decode('utf-8'))
def decode(tensor):
   return _decode_ufunc(tensor.numpy())
")$decode

x <- tf$constant(as.character(1:12), shape = shape(3, 4))
x
#> tf.Tensor(
#> [[b'1' b'2' b'3' b'4']
#>  [b'5' b'6' b'7' b'8']
#>  [b'9' b'10' b'11' b'12']], shape=(3, 4), dtype=string)
decode(x)
#>      [,1] [,2] [,3] [,4]
#> [1,] "1"  "4"  "7"  "10"
#> [2,] "2"  "5"  "8"  "11"
#> [3,] "3"  "6"  "9"  "12"

Created on 2020-04-15 by the reprex package (v0.3.0)

t-kalinowski commented 4 years ago

Another update, just found an acceptable workaround for now: use np.frompyfunc which always returns PyObjects, which then make it into reticulate with the correct ordering.


library(tensorflow)

decode <- py_run_string(
"import numpy as np
_decode_ufunc = np.frompyfunc(lambda b: b.decode('utf-8'), 1, 1)
def decode(tensor):
   return _decode_ufunc(tensor.numpy())
")$decode

x <- tf$constant(as.character(1:12), shape = shape(3, 4))
x
#> tf.Tensor(
#> [[b'1' b'2' b'3' b'4']
#>  [b'5' b'6' b'7' b'8']
#>  [b'9' b'10' b'11' b'12']], shape=(3, 4), dtype=string)
decode(x)
#>      [,1] [,2] [,3] [,4]
#> [1,] "1"  "2"  "3"  "4" 
#> [2,] "5"  "6"  "7"  "8" 
#> [3,] "9"  "10" "11" "12"

Created on 2020-04-15 by the reprex package (v0.3.0)