scikit-hep / awkward

Manipulate JSON-like data with NumPy-like idioms.
https://awkward-array.org
BSD 3-Clause "New" or "Revised" License
829 stars 85 forks source link

feat: copy_behaviors to make sub-classing easy #3137

Closed Saransh-cpp closed 3 months ago

Saransh-cpp commented 4 months ago

XRef #2433

lgray commented 3 months ago

Any update on the tests for this PR?

Saransh-cpp commented 3 months ago

Hi, this already has tests! I think I forgot to request a review.

Saransh-cpp commented 3 months ago

Hi @jpivarski, thanks for the review and fixing the PR! Sorry for taking too long to respond here. The MWE below shows why I added the if condition in this PR -

import awkward as ak
import numpy

class SuperVector:
    def add(self, other):
        return ak.zip(
            {"x": self.x + other.x, "y": self.y + other.y},
            with_name="VectorTwoD",
            behavior=self.behavior,
        )

@ak.mixin_class(ak.behavior)
class VectorTwoD(SuperVector):
    def __eq__(self, other):
        return ak.all(self.x == other.x) and ak.all(self.y == other.y)

ak.behavior[numpy.add, "VectorTwoD", "VectorTwoD"] = lambda v1, v2: v1.add(v2)
print(ak.behavior)

@ak.mixin_class(ak.behavior)
class VectorTwoDAgain(VectorTwoD):
    pass

print(ak.behavior) 
# ('*', 'VectorTwoDAgain'): <class '__main__.VectorTwoDAgainArray'>
# 'VectorTwoDAgain': <class '__main__.VectorTwoDAgainRecord'>

ak.behavior.update(
    ak._util.copy_behaviors(VectorTwoD, VectorTwoDAgain, ak.behavior)
)
print(ak.behavior) 
# ('*', 'VectorTwoDAgain'): <class '__main__.VectorTwoDArray'> -> removing the star condition (the class is wrong)
# ('V', 'e', 'c', 't', 'o', 'r', 'T', 'w', 'o', 'D', 'A', 'g', 'a', 'i', 'n'): <class '__main__.VectorTwoDAgainRecord'> -> removing the string condition

Additionally, replacing the generic string condition with a more specific case -

if oldname != key:

leads to -

{'VectorTwoD': <class '__main__.VectorTwoDRecord'>, ('*', 'VectorTwoD'): <class '__main__.VectorTwoDArray'>, (<ufunc 'add'>, 'VectorTwoD', 'VectorTwoD'): <function <lambda> at 0x1050a7ba0>, 'VectorTwoDAgain': <class '__main__.VectorTwoDAgainRecord'>, ('*', 'VectorTwoDAgain'): <class '__main__.VectorTwoDAgainArray'>, (<ufunc 'add'>, 'VectorTwoDAgain', 'VectorTwoDAgain'): <function <lambda> at 0x1050a7ba0>, ('V', 'e', 'c', 't', 'o', 'r', 'T', 'w', 'o', 'D', 'A', 'g', 'a', 'i', 'n'): <class '__main__.VectorTwoDAgainRecord'>}

Could you please let me know what should be done here? I tested the current implementation with Coffea and everything works as expected. (I pointed out that there is a bug in copy_behaviors on Slack but that was a false alarm.)

jpivarski commented 3 months ago
('V', 'e', 'c', 't', 'o', 'r', 'T', 'w', 'o', 'D', 'A', 'g', 'a', 'i', 'n')

You definitely want to check for the isinstance(key, str) case and not iterate over the string as though it were a tuple.

I was assuming that you'd copy all of the behaviors, including the "VectorTwoD" and the ("*", "VectorTwoD") ones, and then replace the ones that are different for the new class, which may be NumPy overloads, Numba overloads, or whole class overloads. That's why I was thinking you'd want this function:

def copy_behaviors(existing_class: typing.Any, new_class: typing.Any, behavior: dict):
    output = {}

    oldname = existing_class.__name__
    newname = new_class.__name__

    for key, value in behavior.items():
        if isinstance(key, str):
            if key == oldname:
                output[newname] = value
        else:
            if oldname in key:
                new_tuple = tuple(newname if k == oldname else k for k in key)
                output[new_tuple] = value

    return output

and then define the new class after having copied the behaviors:

class VectorTwoD(SuperVector):
    def __eq__(self, other):
        return ak.all(self.x == other.x) and ak.all(self.y == other.y)

ak.behavior["VectorTwoD"] = VectorTwoD
ak.behavior["*", "VectorTwoD"] = VectorTwoD
ak.behavior[numpy.add, "VectorTwoD", "VectorTwoD"] = lambda v1, v2: v1.add(v2)

ak.behavior.update(ak._util.copy_behaviors(VectorTwoD, VectorTwoDAgain, ak.behavior))

class VectorTwoDAgain(VectorTwoD):
    pass

ak.behavior["VectorTwoDAgain"] = VectorTwoDAgain
ak.behavior["*", "VectorTwoDAgain"] = VectorTwoDAgain
ak.behavior[numpy.add, "VectorTwoDAgain", "VectorTwoDAgain"] = lambda v1, v2: v1.add(v2)

But now I see the problem. You were using the @mixin_class decorator to add both Array and Record forms of the class to ak.behavior, which is equivalent to assigning them directly into the ak.behavior dict, as I did above, but since it's a decorator, it has to be assigned exactly when the class gets defined. The current API for copy_behaviors takes the class object as an argument, so it has to be called after the class gets defined. That's why you can't copy all behaviors and then overwrite some of them: there are causality constraints on when they can be defined and when copy_behaviors can be called.

This is going to be a problem because the "VectorTwoD" and the ("*", "VectorTwoD") keys are not the only ones that get assigned by decorators: keys like (numpy.add, "VectorTwoD", "VectorTwoD") can also be defined by @ak.mixin_class_method, which has to be called at the time when the class is defined. (See ak.mixin_class_method and Mixin decorators.) If the developer writes a class using this instead of assigning a lambda expression, they'd be in just as much trouble as my example above.

The copy_behaviors function needs to not take the class object as an argument, and it should (as a best practice) be called before the class gets defined. In fact, the function doesn't even use the class objects, just their names; it should only ask for what it's going to need.

This should be enough:

def copy_behaviors(from_name: str, to_name: str, behavior: dict):
    output = {}

    for key, value in behavior.items():
        if isinstance(key, str):
            if key == from_name:
                output[to_name] = value
        else:
            if from_name in key:
                new_tuple = tuple(to_name if k == from_name else k for k in key)
                output[new_tuple] = value

    return output

which would then be used like this:

class VectorTwoD(SuperVector):
    def __eq__(self, other):
        return ak.all(self.x == other.x) and ak.all(self.y == other.y)

ak.behavior["VectorTwoD"] = VectorTwoD
ak.behavior["*", "VectorTwoD"] = VectorTwoD
ak.behavior[numpy.add, "VectorTwoD", "VectorTwoD"] = lambda v1, v2: v1.add(v2)

ak.behavior.update(ak._util.copy_behaviors("VectorTwoD", "VectorTwoDAgain", ak.behavior))

class VectorTwoDAgain(VectorTwoD):
    pass

ak.behavior["VectorTwoDAgain"] = VectorTwoDAgain
ak.behavior["*", "VectorTwoDAgain"] = VectorTwoDAgain
ak.behavior[numpy.add, "VectorTwoDAgain", "VectorTwoDAgain"] = lambda v1, v2: v1.add(v2)

or like this:

@ak.mixin_class(ak.behavior)
class VectorTwoD(SuperVector):
    def __eq__(self, other):
        return ak.all(self.x == other.x) and ak.all(self.y == other.y)

ak.behavior[numpy.add, "VectorTwoD", "VectorTwoD"] = lambda v1, v2: v1.add(v2)

ak.behavior.update(ak._util.copy_behaviors("VectorTwoD", "VectorTwoDAgain", ak.behavior))

@ak.mixin_class(ak.behavior)
class VectorTwoDAgain(VectorTwoD):
    pass

ak.behavior[numpy.add, "VectorTwoDAgain", "VectorTwoDAgain"] = lambda v1, v2: v1.add(v2)

The general strategy, for a developer who is using the copy_behaviors function, is to (1) copy everything and (2) replace what's different about this new class. The copy_behaviors API needs to be flexible enough to let them do that. Since some ways of defining behaviors are bound to a class decorator, constraining when they can be assigned, the copy_behaviors has to not depend on the class object, or else there will be a circular dependency.

Saransh-cpp commented 2 months ago

Thank you for the detailed explanation! I went through it and have created a PR with the fix.

nsmith- commented 2 months ago

How should I interpret the fact this is in the _util module? Is it private? Is it ok to use in coffea?

Saransh-cpp commented 2 months ago

copy_behaviors is definitely intended to be a public API. Maybe I should move the function to another module?