scikit-hep / vector

Vector classes and utilities
https://vector.readthedocs.io
BSD 3-Clause "New" or "Revised" License
78 stars 25 forks source link

Single component value can not be changed in VectorNumpy3D #158

Open cansik opened 2 years ago

cansik commented 2 years ago

I am using the vector library in a computer-vision software to store the prediction results. There I have a two-step workflow where I first have to calculate some values, then refine a part of them (for example the depth component z). After implementing everything I never got the right results and wondered why.

I noticed that it is possible to set a single component in a vector-object:

v = vector.obj(x=1.1, y=2.2)
v.x = 25

print(v)
# vector.obj(x=25, y=2.2)

But this is not possible as soon as I work with VectorNumpy3D (and 2D, 4D):

v = vector.array({
    "x": [1.0, 2.0],
    "y": [1.1, 2.2],
    "z": [0.1, 0.2],
})

v[1].x = 25

print(v)
# [(1., 1.1, 0.1) (2., 2.2, 0.2)]

Either there should be an error message, that it is not possible to set single component values in VectorNumpy3D arrays or the value should be set. Atm the line of code seems just like being ignored.

Is there a way to change a single value inside of this array structure?

jpivarski commented 2 years ago

Hmm. It's hard to see how there can be a way to do what you want (either an error message or a change in the original array). When you say v[1], you get a new VectorObject3D, and you can set the x value of that new object to 25, then that new object goes out of scope. Pandas has this problem too, and they raise a SettingWithCopyWarning (which is complained about far more than any other Pandas warning).

If, instead of

v[1].x = 25

you had said

v.x[1] = 25

it would have assigned to the original array: v.x makes a NumPy array that shares memory with the vector array, so when you assign to that, you get the change in the original.

Mutability and views vs copies have always been murky, which is why there's such a strong movement behind functional programming.

Anyway, to actually implement this, we would need a new parameter to VectorObject*Ds or a new subclass that links it to the original array, so that we know how to propagate changes back. However, that would have the surprising consequence that extracting some elements from an array and deleting the array won't actually free any memory because each extracted element holds a reference to the whole array. VectorObject*Ds that come from Awkward Arrays can't have this feature because Awkward Arrays are immutable (the view-vs-copy issue is even more complex for them). Maybe instead, there should be a parameter or new type of VectorObject*D that is not connected to the original array but has a "do not write" flag, so that at least you get an error message, whether the VectorObject*D is derived from a NumPy array or Awkward Array. However, that prohibition against writing is merely formal: some users would want to write to these objects and they'd need a way to override the "do not write" flag, which would be another flag.

At the end of that brainstorming, the "do not write" flag sounds like it would cause the least problems, and it would present an error message explaining all of these issues with a way to override the flag. I suppose that would be a writable: bool property on VectorObject*D. What do you think?

(I'm going to label this as a feature request.)

cansik commented 2 years ago

@jpivarski Thank you very much for the comprehensive explanation. I guess for now it's fine to access the original array with v.x[1] = 25.