scott-griffiths / bitstring

A Python module to help you manage your bits
https://bitstring.readthedocs.io/en/stable/index.html
MIT License
401 stars 67 forks source link

Create a cast method for Arrays? #291

Open scott-griffiths opened 11 months ago

scott-griffiths commented 11 months ago

Changing the dtype of an Array just changes the interpretation of the underlying data. This is fine, and is a O(1) operation which fits with changing a property, but some users might want or expect it to recast the data to the new type.

To cast to a new dtype you need to do this:

a = Array('u8', [1, 2, 3, 4, 5, 6, 7, 8])

b = Array('float64', a.tolist())

which is OK, and explicit, but adding a new method could make it clearer and give more options:

b = a.cast('float64')

I don't think it's good to do it in place - there's no performance gain. We can now also deal with things like overflows better:

c = b.cast('u16', clip=True)

so the user can choose whether to get a ValueError or to clip values or whatever (divide by zero would be another one).

scott-griffiths commented 11 months ago

Probably should be called astype to copy numpy.

The numpy method has a casting parameter which can be one of:

‘no’ means the data types should not be cast at all. [Not sure what the point of this option is!] ‘equiv’ means only byte-order changes are allowed. [Reasonable I guess] ‘safe’ means only casts which can preserve values are allowed. [Only widening casts or unsigned to signed?] ‘same_kind’ means only safe casts or casts within a kind, like float64 to float32, are allowed. [ ‘unsafe’ means any data conversions may be done.

From experimentation, if it doesn't have room to store the full value it simple truncates the binary representation, so for example an int of 2000 becomes an uint8 of 208, which is not exactly obvious or helpful (but admittedly will be fast!)

If you ask for safe casting it just exits with a TypeError.

Maybe our options should be:

clip - values that are too large get clipped to the nearest representable value. safe - If values can't be preserved a ValueError is raised (but it still tries).

The others are more checks on the dtypes, rather than the data, which the user can easily do themselves. If there are two options that boils down to a flag:

clip: If True out of range values are clipped to the nearest representable value, otherwise a ValueError will be raised. Defaults to False.

Which is back to where we started.

scott-griffiths commented 11 months ago

It might be cool to allow the clip to happen as a function call. This would allow it to be used more widely, for example when performing other ops on Arrays. Right now it's hard to add a flag to a y = x*5 command, and y = Array.multiply(x, 5, clip=True) is pretty ugly. Not sure how it actually works in practice though.

a = b*1000   # Throws a ValueError
a = clip(b*1000)    # Magically doesn't and clips instead. Somehow.

Perhaps better would be (b*1000).clip(), but I it's not obvious how it can be implemented.

If we could, the astype would be just c = b.astype('u8').clip()

scott-griffiths commented 11 months ago
with Array.Clipping:
    a = b*1000

is perhaps more obvious and easier to actually code.

scott-griffiths commented 11 months ago

astype method added in 4.1.2. No alternative casting methods yet, so leaving this open.