scikit-hep / awkward-0.x

Manipulate arrays of complex data structures as easily as Numpy.
BSD 3-Clause "New" or "Revised" License
216 stars 39 forks source link

Error from astype() on StringArray and inconsistencies with zeros_like() #199

Open masonproffitt opened 4 years ago

masonproffitt commented 4 years ago

My use case: I need to be able to make a mask for a JaggedArray containing strings, starting with something like this:

jagged_array_of_strings.zeros_like().astype(bool)

but this fails on a couple different levels. The first is that StringArray seems to have a problem with astype():

>>> j = awkward.fromiter(['True'])
>>> j
<StringArray ['True'] at 0x7f6e88799400>
>>> j.astype(bool)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mproffit/anaconda3/lib/python3.7/site-packages/awkward/array/base.py", line 111, in __repr__
    return "<{0} {1} at 0x{2:012x}>".format(self.__class__.__name__, str(self), id(self))
  File "/home/mproffit/anaconda3/lib/python3.7/site-packages/awkward/array/base.py", line 98, in __str__
    return "[{0}]".format(" ".join(self._util_arraystr(x) for x in self.__iter__(checkiter=False)))
  File "/home/mproffit/anaconda3/lib/python3.7/site-packages/awkward/array/base.py", line 98, in <genexpr>
    return "[{0}]".format(" ".join(self._util_arraystr(x) for x in self.__iter__(checkiter=False)))
  File "/home/mproffit/anaconda3/lib/python3.7/site-packages/awkward/array/objects.py", line 177, in __iter__
    for x in self._content:
  File "/home/mproffit/anaconda3/lib/python3.7/site-packages/awkward/array/jagged.py", line 496, in __iter__
    self._valid()
  File "/home/mproffit/anaconda3/lib/python3.7/site-packages/awkward/array/jagged.py", line 466, in _valid
    raise ValueError("maximum offset {0} is beyond the length of the content ({1})".format(self._offsets.max(), len(self._content)))
ValueError: maximum offset 4 is beyond the length of the content (1)

Independently, zeros_like() has some problematic behavior on StringArray as well:

>>> j.zeros_like()
<StringArray ['\x00\x00\x00\x00'] at 0x7f6e887990f0>

My issue with this is that a string of null bytes actually evaluates to True and can't even be directly converted to a number:

>>> bool('\x00')
True
>>> int('\x00')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '\x00'

For comparison, numpy's zeros_like() converts strings to empty strings:

>>> import numpy as np
>>> a = np.array('True')
>>> a
array('True', dtype='<U4')
>>> np.zeros_like(a)
array('', dtype='<U4')

Empty strings do convert to False (i.e., bool('') is False).

As an aside, astype(bool) oddly doesn't actually work on this ndarray:

>>> np.zeros_like(a).astype(bool)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: ''

But the following does work (and unfortunately doesn't have an equivalent in awkward as far as I'm aware):

>>> np.zeros_like(a, dtype=bool)
array(False)

Edit: Turns out this known problem in numpy has been sitting around for a couple years: https://github.com/numpy/numpy/issues/9875

masonproffitt commented 4 years ago

I suppose I can use boolmask() as a workaround for my original purpose, although it's not quite exactly the same as the behavior I'm looking for. The issues above are still valid, though.

jpivarski commented 4 years ago

Converting strings that say "True" and "False" into booleans is a little beyond what astype is supposed to do, even in Numpy:

>>> numpy.array(["True", "False", "True", "True", "False"]).astype(bool)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: 'True'
>>> numpy.array(["true", "false", "true", "true", "false"]).astype(bool)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: 'true'

In Awkward, a StringArray is a JaggedArray of uint8 with special methods to interpret them as strings. This makes it different from a Numpy string array (which isn't jagged). What exactly that should mean for astype, I'm not sure. I'll have to think about it. But anyway, what you're looking for here is going against the grain of the spirit of the API and we should think more about a better design (at least, better error messages—Numpy has the decency to give you an error message, and perhaps Awkward should, too—I'll have to think about that).

masonproffitt commented 4 years ago

Right, the numpy thing is a bug, which is mentioned in the issue that I linked. I think the proper thing to do is to handle conversion the same way that Python does natively. bool('') is False; bool('True'), bool('False') are both True, which I assume is the case for any non-empty string.

The specific astype() issue is an odd case and I don't see it as a high priority, but I do think it's important to have something like zeros_like(dtype=bool)--or even better, an awkward version of numpy.full_like(), which I don't think exists (?).

jpivarski commented 4 years ago

I don't think there's even a full_like in Numpy, although I can see its usefulness.

(Actually, I only implemented AwkwardArray.empty_like, zeros_like, and ones_like because I needed them internally and thought they might have some public utility as a secondary thing. Therefore, I haven't fully thought through their interface, particularly since these functions are often a prelude to __setitem__ in Numpy, and that's not allowed or very limited in awkward.)

As for returning True if and only if the string is non-empty, what about this?

>>> a = awkward.fromiter(["True", "", "False", "", "", "True"])
>>> a
<StringArray ['True' '' 'False' '' '' 'True'] at 0x7f8cc3ce98d0>
>>> a.counts
array([4, 0, 5, 0, 0, 4])
>>> a.counts.astype(bool)
array([ True, False,  True, False, False,  True])
masonproffitt commented 4 years ago

Well, what I originally wanted was an array with the same shape as the string array but every entry filled with False. Your example works for a flat array of strings but doesn't work for a JaggedArray. What I just found that does work is .localindex.zeros_like().astype(bool), but that's not intuitive and a lot of typing...

(numpy.full_like() does in fact exist: https://docs.scipy.org/doc/numpy/reference/generated/numpy.full_like.html)

jpivarski commented 4 years ago

Right, and your example (with localindex) only works for exactly one level of jaggedness.