Open t-kalinowski opened 4 years ago
Thanks!! (although that was probably directed to @jjallaire :-))))
and thanks for spotting that! Probably bytestring -related because before conversion it's
> x$numpy
<bound method _EagerTensorBase.numpy of <tf.Tensor: shape=(3, 4), dtype=string, numpy=
array([[b'1', b'2', b'3', b'4'],
[b'5', b'6', b'7', b'8'],
[b'9', b'10', b'11', b'12']], dtype=object)>>
trying to look at it later today or early next week!
Looks like @skeydan's gut was right:
library(reticulate)
library(tensorflow)
data <- list(
list("a", "b", "c"),
list("d", "e", "f"),
list("g", "h", "i")
)
tf <- import("tensorflow", convert = FALSE)
t <- tf$constant(data, shape = shape(3, 3))$numpy()
py_to_r(t)
#> [,1] [,2] [,3]
#> [1,] ? ? ?
#> [2,] ? ? ?
#> [3,] ? ? ?
py_to_r(t$astype("str"))
#> [,1] [,2] [,3]
#> [1,] "a" "d" "g"
#> [2,] "b" "e" "h"
#> [3,] "c" "f" "i"
py_to_r(t$astype("bytes"))
#> [,1] [,2] [,3]
#> [1,] "a" "d" "g"
#> [2,] "b" "e" "h"
#> [3,] "c" "f" "i"
Created on 2020-01-31 by the reprex package (v0.3.0)
I think the issue is independent of C-style versus Fortran-style array ordering, though. Rather, I think something about the way bytestrings are stored in these TensorFlow objects is confusing reticulate
.
The problem seems to be TF tensors' numpy
method, which convert Unicode strings (https://www.tensorflow.org/tutorials/load_data/unicode) to NumPy object
s:
>>> tf.constant([u"You're", u"welcome!"]).numpy().dtype
dtype('O') # object
According to this issue
https://github.com/tensorflow/tensorflow/issues/34871
this is "expected behavior" (?).
Thinking about workarounds from the user side, @kevinushey's code above would constitute one but as far as I can see, require importing tf
with convert = FALSE
(as I don't see how to achieve the same result on statement level here, with tf$constant
involved...? But perhaps I'm missing sth)
From the reticulate
side, I wonder if in addition to is_python_string
we should have another function that checks if the object could be converted to unicode?
Do we have any indication that this causes any more substantial issues (beyond conversion from Python to R)? In other words, do models fit in TensorFlow using these objects still produce the expected predictions, and so on?
it looks like the R to python conversion works fine, and it's only on the way back that this bug shows it's head. So it'll depend on the workflow and if the preprocessing pipeline make data go between tf -> R -> tf at least once. I ran into this because in serving a trained model, I was using a tf op to process the raw probs into top-k class labels as strings. Those were showing up wrong after switching to python3, making my model look waaaaay worse than it actually is. To be clear, the model didn't change, but the model outputs as they presented in R changed just by switching to python3.
That sounds like we may want to do sth here?
@kevinushey what do you think about my suggestion above?
My main worry is that this could be expensive. Unless I'm misunderstanding, we'd have to check each element of the numpy array individually to confirm that each element is indeed a "string" (or can be converted to string).
I wonder if there's some better way forward?
Would it be possible to do what you did in python2, just convert PyBytes
to R strings?
This seems to work right now from the R side:
as.character(py_eval("b'foo'"))
Yes but after the call to numpy()
, the array elements will be NumPy object
s (see https://github.com/rstudio/reticulate/issues/707#issuecomment-581362641 above), and not easily recognized as strings.
I don't know why, but PyUnicode_Check
returns false on them... In fact, although I'm not sure this would be the correct thing to do (or would need further consideration what to do on the R side, at least) I just wanted to add a check for PyBytes
as you suggest ... but strangely, even though
int PyBytes_Check(PyObject *o)
has been existing since 3.5 at least, I get undefined symbol
when loading Python...
Same for the macro approach which we're following here
If I try the same for PyBytesObject
I get undefined symbol as well...
Would you have an idea what could be going on there @kevinushey ?
To answer these questions, you usually need to look into the Python source code:
Because those functions are defined as macros, they cannot be loaded from the Python library. We accommodate some of these sorts of macros here:
We might be able to load and check against the PyBytes_Type
, as it's defined here:
Oh, I see, thanks!
Got it to work, but PyBytes_Check
returns false (and also the NumPy type is 17 == object)...
So I guess the only remaining option would be the convertibility check - and how much this would affect performance should depend on how many cases of non-string-objects we'd have... meaning, objects that fall through all tests in
(((I don't really have an idea how often that function would actually return false...)))
BTW I still think TF's numpy()
should not return NumPy object
s in these cases, and I don't see how it's "intended behavior", as concluded in the TF issue I linked above...
Quick update, this is still an issue. I though I could work around it by converting bytes to str
in python3, but that didn't help.
library(tensorflow)
decode <- py_run_string(
"import numpy as np
_decode_ufunc = np.vectorize(lambda z: z.decode('utf-8'))
def decode(tensor):
return _decode_ufunc(tensor.numpy())
")$decode
x <- tf$constant(as.character(1:12), shape = shape(3, 4))
x
#> tf.Tensor(
#> [[b'1' b'2' b'3' b'4']
#> [b'5' b'6' b'7' b'8']
#> [b'9' b'10' b'11' b'12']], shape=(3, 4), dtype=string)
decode(x)
#> [,1] [,2] [,3] [,4]
#> [1,] "1" "4" "7" "10"
#> [2,] "2" "5" "8" "11"
#> [3,] "3" "6" "9" "12"
Created on 2020-04-15 by the reprex package (v0.3.0)
Another update, just found an acceptable workaround for now:
use np.frompyfunc
which always returns PyObjects, which then make it into reticulate with the correct ordering.
library(tensorflow)
decode <- py_run_string(
"import numpy as np
_decode_ufunc = np.frompyfunc(lambda b: b.decode('utf-8'), 1, 1)
def decode(tensor):
return _decode_ufunc(tensor.numpy())
")$decode
x <- tf$constant(as.character(1:12), shape = shape(3, 4))
x
#> tf.Tensor(
#> [[b'1' b'2' b'3' b'4']
#> [b'5' b'6' b'7' b'8']
#> [b'9' b'10' b'11' b'12']], shape=(3, 4), dtype=string)
decode(x)
#> [,1] [,2] [,3] [,4]
#> [1,] "1" "2" "3" "4"
#> [2,] "5" "6" "7" "8"
#> [3,] "9" "10" "11" "12"
Created on 2020-04-15 by the reprex package (v0.3.0)
This is the (I think correct) behavior in python2:
Created on 2020-01-30 by the reprex package (v0.3.0)
Created on 2020-01-30 by the reprex package (v0.3.0)
Note, the order of the elements in python3 is not what I would expect.
btw, congrats on becoming a PBC!