patrick-kidger / torchtyping

Type annotations and dynamic checking for a tensor's shape, dtype, names, etc.
Apache License 2.0
1.41k stars 34 forks source link

allow named sizes for named dimensions #15

Closed teichert closed 3 years ago

teichert commented 3 years ago

(Thanks for this awesome work!)

I've often seen applications in which multiple dimensions of the same tensor will be the same size (which size is only known at runtime). It is useful to be able to separately document that those sizes must be the same while acknowledging the differing purposes of these dimensions.

The following example demonstrates the idea---showing three related features that already work in torchtyping as well as a proposed feature:

def func(feats: TensorType["b": ..., "annotator": 3, "word": "words", "feature"],
         predicates: TensorType["b": ..., "annotator": 3, "predicate": "words", "feature"],
         pred_arg_pairs: TensorType["b": ..., "annotator": 3, "predicate": "words", "argument": "words"]):
    # feats has shape (..., 3, words, features)
    # predicates has shape (..., 3, words, features)
    # pred_arg_pairs has shape (..., 3, words, words)
    # the ... b dimensions are checked to be of the same size.

Things that already work:

Proposed:

Additionally, you would probably want to enforce that, if the specified "size name" matches the name of another dimension (like "word" in the following example), then the sizes of those dimensions should be the same:

def func(feats: TensorType["b": ..., "annotator": 3, "word", "feature"],
         predicates: TensorType["b": ..., "annotator": 3, "predicate": "word", "feature"],
         pred_arg_pairs: TensorType["b": ..., "annotator": 3, "predicate": "word", "argument": "word"]):
    # feats has shape (..., 3, words, features)
    # predicates has shape (..., 3, words, features)
    # pred_arg_pairs has shape (..., 3, words, words)
    # the ... b dimensions are checked to be of the same size.

Thoughts?

patrick-kidger commented 3 years ago

Hmm. Essentially you're saying that you'd like to have something like

TensorType["same_size", "same_size"]

for checking reasons, but something like

TensorType["meaningful_name", "different_meaningful_name"]

for documentation reasons?

So I'm quite concerned that supporting this via a str: str syntax is a little opaque / opens one up to silent mistakes. (e.g. getting them the wrong way around; generally being a confusing hurdle for new users.)
Moreover, every annotation as part of a TensorType is or can be checked in some way; this would introduce a notation that is documentation-only.

At the same time I recognise the value in what you're suggesting. If you have an alternate syntax I'd be open to ideas?

Alternately, it would be very straightforward for you to add this syntax yourself, and just use that locally:

class MyTensorType:
    def __class_getitem__(cls, item):
        if not isinstance(item, tuple):
            item = (item,)
        item = list(item)
        for i, item_i in enumerate(item):
            if isinstance(item, slice) and isinstance(item.start, str) and isinstance(item.stop, str):
                item[i] = item.stop
        item = tuple(item)
        return TensorType[item]

(untested code) This just strips away the first string in a str: str pair and then passes it on to TensorType as normal.

teichert commented 3 years ago

@patrick-kidger Nice summary of the intended use-case, and thanks for the great (and speedy) feedback!

To your first point, I can appreciate wanting to avoid str: str confusion. What would you think of something like: str: sizeof(str)?

I also see your second point that the meaningful name may merely be uncheckable documentation since you could change it without impacting the checking semantics. However, I believe that this is already the case for dimension names in the current setup (for unnamed tensors)---unless a name is reused, it only serves as documentation, right? Furthermore, for named tensors, both the meaningful names and the same_size constraints would be enforceable: t.names[0] == "meaningful_name" and t.names[1] == "different_meaningful_name" and t.shape[0] == t.shape[1].

Additionally, your concern about "documentation-only" might offer additional incentive to have the checks require that the size of a dimension specified as name1: sizeof(name2) should be required to equal the size of all dimensions named "name1" and all dimensions named "name2", that way, the only time that a dimension name does not lead to a checkable size constraint is when that name is only used once (which, again, is the status-quo).

In fact, in light of that point, perhaps name1: sizeof(name2) usage should trigger an additional check that "name2" must be used elsewhere. Since name2 is exclusively for enforcing size but specifically not for documentation of the dimension's purpose, not reusing it can be considered a bug, allowing the new usage to help expose more mistakes: e.g. TensorType["same_size", "same__size"] wouldn't be able to report any errors if the two dimensions didn't have the same size, but TensorType["name1": "same_size", "name2": "same__size"] would be able to complain that "same__size" was only used once).

Finally, I appreciate the example code you gave which should be able to satisfy me on this aspect if necessary and also answers a separate question of mine. Thanks!

patrick-kidger commented 3 years ago

require that the size of a dimension specified as name1: sizeof(name2) should be required to equal the size of all dimensions named "name1" and all dimensions named "name2"

This makes sense to me. Checking both sides of the : makes it harder to make mistakes. Only thorn in that is that checking named dimensions could only happen according to a single name. (The first, for consistency's sake.) But that's not too bad.

I think I'd suggest sticking with just str: str, and just have both names be bound, with checking performed according to both names. That is, without the additional notation of sizeof, and thus without the additional complexity of make-sure-the-name-is-used-elsewhere checking. I think it's important to try and keep things simple, for the sake of the new user experience.

I'd be happy to accept a PR on this, if you want to put something together.

teichert commented 3 years ago

I think I'd suggest sticking with just str: str, and just have both names be bound, with checking performed according to both names. That is, without the additional notation of sizeof, and thus without the additional complexity of make-sure-the-name-is-used-elsewhere checking. I think it's important to try and keep things simple, for the sake of the new user experience.

Thanks! I agree, and it plays nicely with the syntax described in #2 where "x" is interpreted as sizeof("x"). Also, even with the str: str option, make-sure-the-name-is-used-elsewhere checking could still be added eventually if ever desired.

I'd be happy to accept a PR on this, if you want to put something together.

Great; I'll look into it.

teichert commented 3 years ago

@patrick-kidger, am I right that being able to use this safely with mypy is still waiting on python/mypy#10266? (as are other annotations that involve slices TensorType[batch: ..., 10] and TensorType["row": 10, "col": 10])

patrick-kidger commented 3 years ago

I believe so, I'm afraid.