Open masonproffitt opened 5 years ago
I suppose I can use boolmask()
as a workaround for my original purpose, although it's not quite exactly the same as the behavior I'm looking for. The issues above are still valid, though.
Converting strings that say "True"
and "False"
into booleans is a little beyond what astype
is supposed to do, even in Numpy:
>>> numpy.array(["True", "False", "True", "True", "False"]).astype(bool)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: 'True'
>>> numpy.array(["true", "false", "true", "true", "false"]).astype(bool)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: 'true'
In Awkward, a StringArray
is a JaggedArray
of uint8
with special methods to interpret them as strings. This makes it different from a Numpy string array (which isn't jagged). What exactly that should mean for astype
, I'm not sure. I'll have to think about it. But anyway, what you're looking for here is going against the grain of the spirit of the API and we should think more about a better design (at least, better error messages—Numpy has the decency to give you an error message, and perhaps Awkward should, too—I'll have to think about that).
Right, the numpy thing is a bug, which is mentioned in the issue that I linked. I think the proper thing to do is to handle conversion the same way that Python does natively. bool('')
is False
; bool('True')
, bool('False')
are both True
, which I assume is the case for any non-empty string.
The specific astype()
issue is an odd case and I don't see it as a high priority, but I do think it's important to have something like zeros_like(dtype=bool)
--or even better, an awkward version of numpy.full_like()
, which I don't think exists (?).
I don't think there's even a full_like
in Numpy, although I can see its usefulness.
(Actually, I only implemented AwkwardArray.empty_like
, zeros_like
, and ones_like
because I needed them internally and thought they might have some public utility as a secondary thing. Therefore, I haven't fully thought through their interface, particularly since these functions are often a prelude to __setitem__
in Numpy, and that's not allowed or very limited in awkward.)
As for returning True
if and only if the string is non-empty, what about this?
>>> a = awkward.fromiter(["True", "", "False", "", "", "True"])
>>> a
<StringArray ['True' '' 'False' '' '' 'True'] at 0x7f8cc3ce98d0>
>>> a.counts
array([4, 0, 5, 0, 0, 4])
>>> a.counts.astype(bool)
array([ True, False, True, False, False, True])
Well, what I originally wanted was an array with the same shape as the string array but every entry filled with False
. Your example works for a flat array of strings but doesn't work for a JaggedArray. What I just found that does work is .localindex.zeros_like().astype(bool)
, but that's not intuitive and a lot of typing...
(numpy.full_like()
does in fact exist: https://docs.scipy.org/doc/numpy/reference/generated/numpy.full_like.html)
Right, and your example (with localindex
) only works for exactly one level of jaggedness.
My use case: I need to be able to make a mask for a
JaggedArray
containing strings, starting with something like this:but this fails on a couple different levels. The first is that
StringArray
seems to have a problem withastype()
:Independently,
zeros_like()
has some problematic behavior onStringArray
as well:My issue with this is that a string of null bytes actually evaluates to
True
and can't even be directly converted to a number:For comparison, numpy's
zeros_like()
converts strings to empty strings:Empty strings do convert to False (i.e.,
bool('')
isFalse
).As an aside,
astype(bool)
oddly doesn't actually work on thisndarray
:But the following does work (and unfortunately doesn't have an equivalent in awkward as far as I'm aware):
Edit: Turns out this known problem in numpy has been sitting around for a couple years: https://github.com/numpy/numpy/issues/9875