larray-project / larray

N-dimensional labelled arrays in Python
https://larray.readthedocs.io/
GNU General Public License v3.0
8 stars 6 forks source link

Array.apply does not detect dtype correctly #989

Open gdementen opened 2 years ago

gdementen commented 2 years ago

I tried this as a workaround to #988:

>>> arr = full("a=a0..a2", None)
>>> arr.dtype
dtype('O')
>>> arr['a1'] = 'value'
>>> arr == None
False
>>> arr.apply(lambda v: 0 if v is None else v)
ValueError: invalid literal for int() with base 10: 'value'
>>> arr.apply(lambda v: 0 if v is None else v, dtype=object)
a  a0     a1  a2
    0  value   0
alixdamman commented 2 years ago

The title is little bit vague and the example given in the description sounds like very specific.

When you say that Array.apply does not detect dtype correctly, you mean globally or in delimited specifics cases ?

gdementen commented 2 years ago

The title is little bit vague and the example given in the description sounds like very specific.

When you say that Array.apply does not detect dtype correctly, you mean globally or in delimited specifics cases ?

I did not know. The test case above failed and I did not investigate at the time. Now I just checked and the "culprit" is np.vectorize code path (which is only called if the by argument is not used) uses the dtype for the first element as the dtype for the whole array (unless explicitly provided), which is not correct for my test case.

The data type of the output of vectorized is determined by calling the function with the first element of the input.

Adding to the confusion, the dtype argument of our own apply function is documented as:

dtype : type or list of types, optional Output(s) data type(s). Defaults to None (inspect all output values to infer it automatically).

Either way, we should have the same behavior by default for both codepaths. There are several ways to fix this though:

  1. inspect all values to determine the dtype in the "non-by" codepath.
  2. use first element dtype by default in the "by" codepath (and update dtype docstring accordingly).
  3. implement both kinds of dtype detection (dtype="auto" and dtype="first") for both code paths. There is still the question of which one to use by default. I think I would favor "auto" because I think it is better to be correct by default than fast.

I would prefer option 3, but option 1 could be an intermediate step in the right direction.