ENH: make ndarray generic over dtype

person142 commented 4 years ago

Closes https://github.com/numpy/numpy-stubs/issues/7.

Keeping this as a draft as some discussion will be needed. This makes ndarray parameterized over a type variable with a bound of np.generic; more concretely that's things like np.ndarray[np.int64]. Note that since ndarray is not generic at runtime, you have to use the usual tricks to add annotations in practice.

~~A concrete issue is that it doesn't yet work correctly with subclasses; note that I commented out one of the subclass tests below.~~

Edit: nevermind, was hitting stale mypy cache.

On a more philosophical level, there's a question of "is this going to make supporting shapes in the future harder" because presumably that is going to require being generic over shape. One way to handle that would be to make ndarray generic over shape right now, but just do nothing with it. Then all annotations should be ndarray[dtype, Any] in preparation for whatever happens with shapes. Given that it seems pretty unclear what shape support will look like in the future, that might end up being off-base though.

Thoughts?

rgommers commented 4 years ago

There's a long discussion on shape support in gh-5. Not sure how that relates to ndarray[dtype, Any] though.

person142 commented 4 years ago

There's a long discussion on shape support in gh-5. Not sure how that relates to ndarray[dtype, Any] though.

Oops sorry, I was a little too in my head when I wrote that description. Let me try to elaborate.

I'm approaching this with the perspective that the various proposed approaches to shape are impossible to express with current Python typing
- The closest you could come to shape is maybe with Tuple + Literal (e.g. Tuple[Literal[2], Literal[2]]), but you can't do arithmetic with literals, so I don't think they are the right tool for the job (at least not yet).
- The absence of Literal arithmetic makes simpler proposals like just typing rank still unreachable too
So upstream work probably needs to be done to get shape working
But, dtypes are "easy"
To me it seems better to support dtypes now rather than waiting to resolve shape (which could take years!)
But we could potentially be setting ourselves up for some back compatibility nightmares
- With just dtypes we have ndarray[D], presumably shapes will make that ndarray[D, S]
- If we add S, that's a big breaking change where all code that wrote ndarray[D] stops working
- On the one hand, all of this is experimental and breaking changes should be expected
- On the other hand, we could maybe be very conservative and make it ndarray[D, S] right now, and just say "well the S doesn't do anything yet". And we request that everyone go through the song and dance of writing ndarray[D, Any] until such a time as we can give the second slot some meaning. (Or introduce a type alias like AnyShapeArray[D] = ndarray[D, Any] and ask people to use that.)

person142 commented 4 years ago

Ok, got some overloads for np.ones and np.zeros that get the dtype inferred in some basic cases. We can continue to do better by adding explicit overloads for cases like dtype = int, but at that point we might want to switch to doing codegen (as has been suggested elsewhere).

Taking this out of draft though-modulo the philosophical issues I think it's in an ok state.

seberg commented 4 years ago

I had missed this PR, Ralf just mentioned it in the community meeting. I would like to make sure I see how this relates to NEP 41/42; In other words, what is "dtype" here?

What does this currently enforce for the DType in ndarray[DType]? E.g. you mentioned using ndarray[np.float32] but that is different from np.ndarray.dtype which returns a dtype instance.

That is, from a typing (theoretical) point of view I do not think it is typically necessary to distinguish ndarray[np.dtype("<float64")] and ndarray[np.dtype(">float64")] (these indicate different storage of the same type). A more direct implication are strings, normally you should not care about a specific string length, i.e. you just have string as the type. Of course occasionally having a mores specific "strings of length up to 5" which is np.dtype("U5") could be interesting (although even then you do not care about byte-order).

Within NEP 41/42, I see these as instances of a class (and also type), making ndarray[DType] where DType (capitalized) is the class. This means that I would like to (possibly sooner rather than later), allow you to write np.dtype[np.float32], etc. to get a DType class representing either little or big endian (or metadata), or strings of any length.

In any case, I am not sure how much this actually changes this at all. Typing and classes are somewhat different issues. It is maybe more of a heads-up, that I think the scalar types (hierarchy) is good to use for this dtype, but it may not be quite enough on its own, also we want user DTypes, so users must be able to declare their DType class (and/or scalar type) to be a NumPy DType for typing purposes.

About the shape, just curious, is it really not possible to make np.ndarray[DType] an alias for ndarray[DType, Any] when we add support for the shape? I do not know really know typing module as of now, so have no idea about its limitations.

person142 commented 4 years ago

Warning: I have only vaguely been following the recent dtypes discussion, so apologies in advance for any misunderstandings.

What does this currently enforce for the DType in ndarray[DType]?

Right, so it's not actually a dtype, it's actually enforcing that it be a subclass of np.generic. The problem being that (until NEP 41/42 IIUC) all dtypes are instances of the same class, so from a nominal typing perspective they are all of the same type. Subclass of np.generic is obviously pretty limiting in terms of what we can type, so many things would have to be ndarray[Any].

but that is different from np.ndarray.dtype which returns a dtype instance.

Oops, I had that wrong; fixed it in https://github.com/numpy/numpy-stubs/pull/48/commits/ca3fe78bfcbd207bb3c70d2fb5003536daa0ec66.

Within NEP 41/42, I see these as instances of a class (and also type)

That would be amazing from a typing perspective!

In any case, I am not sure how much this actually changes this at all. Typing and classes are somewhat different issues.

If I am understanding everything correctly (I'm sure that I am not), NEP 41/42 would make it possible to type things in a much better way. For better or worse, typing and subclasses are tightly coupled in Python (in that subclass => subtype), so not having classes for dtypes makes it basically impossible (?) to type them.

There are actually situations like that all over the place in NumPy-e.g. ufuncs are all instances of np.ufunc, so from the nominal typing perspective... they are all the same type. Similarly things wrapped by f2py aren't functions, they are instances of a fortran object... so they are all of the same type too.

seberg commented 4 years ago

Oh, I had hoped that you can somehow type subtypes of classes without subclasses, (say things like integers of a certain range, I guess parametric types, if that was the name for it).

UFuncs are an issue with typing, I guess. Internally I was thinking to use something like ufunc.get_implementaton(DTypes), i.e. a (multiple) dispatching step finding the correct DType(s). Now from a typing perspective that is pretty nightmarish I guess, although I am not sure there is any solution to it, since we need to be able to register more loops and dispatch to them?

As to using the scalar hierarchy, I think that is actually fine. It may have some holes in the long run, but I do think we should have a clean scalar_type -> DType mapping using e.g. np.dtype[scalar_type], so at least for the NumPy buildtdns (which are relevant currently) I think it is OK to rely on the scalar hierarchy. (I still have to make up my mind about np.dtype[np.integer] since it is not a concrete type, but a superclass/type, but I expect it should just exist).

So yes, NEP 41/42, should make things better I guess, since you have a classes for each type (ignoring things such as string lengths, byteorder), and for that you actually mainly need the first tiny step maybe... Although, an alternative where all dtypes are classes (or close) may also be possible (I do not have the vision for that, but we need to finish a thought in that direction in the NEP 41 discussion).

Note that we have not yet accepted the NEP, I expect this to happen, but it is not 100%.

EDIT: One example, I do not think it would be possible for example to type the Unit of a (physical) Unit DType, since that should typically be stored on the dtype instance.

person142 commented 4 years ago

I was thinking to use something like ufunc.get_implementaton(DTypes), i.e. a (multiple) dispatching step finding the correct DType(s). Now from a typing perspective that is pretty nightmarish I guess, although I am not sure there is any solution to it, since we need to be able to register more loops and dispatch to them?

I suspect that typing ufuncs will eventually require something like a mypy plugin. It allows you to hook into the type checking process and make modifications, so we could hook into the ufunc call method and augment it with the available loops. (All very theoretical though.)

As to using the scalar hierarchy, I think that is actually fine. It may have some holes in the long run

Yeah, I think that the approach we have to take with typing NumPy is "it's impossible, but we can probably still get something useful". As typing and NumPy evolve I imagine that the situation will improve (and hopefully the current typing experiments will help inform that process).

seberg commented 4 years ago

After NEP 41 goes in (assuming it does without major changes), we could put in the np.dtype[np.int64] just after 1.19 (or maybe even before, but I doubt it matters). Its new API, but honestly it feels like a very "right" API to me (how the mapping itself is implemented is a bit of a different issue).

person142 commented 4 years ago

@seberg do you happen to have an (in progress I imagine) branch implementing NEP 41? It might be interesting to see what can be done typing-wise against that branch.

seberg commented 4 years ago

@person142 see https://github.com/numpy/numpy/pull/15508 but, I have not added anything to make np.dtype[np.int64] work (I can do that though, if just in a separate branch for you to try). And am still discussing a bit with Eric Wieser mainly right now, e.g. if the scalars can be used directly here. (With may be a side point from Eric's point of view, but I see the hierarchy as important, so if its not part of the DType, then we may have to use the scalar one.)

The main point I have against moving a bit more into the scalar, is that I am not sure it is practical to expect scalars to have information used by arrays attached to them. I.e. if you want to make a dtype for an arbitrary python object you might have to modify the python scalar.

In the current NEP 41 design there is a mapping DType class <-> scalar type (which unfortunately also means its easy to have reference cycle, which I am not sure matters and probably can be avoided although with a bit of spit). That mapping is, admittedly, a bit artificial and could be considered useless if we simply expect scalars carry slightly more weight. This is also a bit funny, since the DType has all information necessary to create a scalar object.

person142 commented 4 years ago

Ok, here's an experimental branch using NEP 41:

https://github.com/person142/numpy-stubs/tree/nep-41

(The tests pass when built against @seberg's NumPy branch.) Some takeaways from that:

I couldn't get mypy to recognize the dtype metaclass, so instead I faked it by overriding __new__ on dtype. Not sure whether I'm messing something up or whether this is something mypy doesn't know about yet.
There's a problem where since instances of the dtype classes are constructed dynamically by calling dtype, there's no way to let the types stubs know the classes exist. I had to fake it by creating bogus classes class Float64(dtype): ... and then hard coding paths like
```
class dtype:
  @overload
  def __new__(cls, obj: Type[float64]) -> Float64
```
@seberg is the purely dynamic nature of the class creation something that's going to change? (Or maybe I'm misunderstanding something?) (I think np.dtype[<thing>] -> <dtype class> is still going to be too dynamic to capture with types.)

seberg commented 4 years ago

Oh, there is no dynamic nature intended. It just seemd like a reasonable way to not overload the namespace with new names and have a uniform mapping scalar type <-> DType. There is some argument for not having a Dtype class at all and putting it into the scalar. I am a bit torn on it, since that means putting things into the scalar and I am not sure are good there (i.e. that also somewhat means not supporting making a DType that maps to an existing python type?).

seberg commented 4 years ago

About the metaclass, I do not think it should matter at all? You can just say np.dtype subclass, no?

person142 commented 4 years ago

About the metaclass, I do not think it should matter at all? You can just say np.dtype subclass, no?

Strictly speaking yes. But it would be nice if we could infer that the type of x is ndarray[Float64] in things like:

x = np.array([1], dtype=np.dtype(np.float64))
x = np.array([1], dtype=np.dtype[np.float64])

And it seems that to capture things like that we'll need to bake in some kind of understanding of the dtype metaclass.

person142 commented 4 years ago

Something which might be relevant: mypy plugins support

get_dynamic_class_hook() can be used to allow dynamic class definitions in mypy

(From https://mypy.readthedocs.io/en/latest/extending_mypy.html#current-list-of-plugin-hooks.)

seberg commented 4 years ago

That get_dynamic_class_hook sounds very promising!? Also, and maybe especially, for ndarray itself?

I would be surprised if mypy doesn't have something for the possible np.dtype[scalar_type] syntax (to be clear: there is nothing fixed about it, I just like it and if we go the DType class route it seems straight forward to me). The reason why this should be something mypy can do easy enough, at least in the future, is that python is even adding this as a classmethod to the type (without the need for a MetaClass). If we have to use the MetaClass/Type, I guess that is fine, it does show a bit that the MetaClass/Type route, even if I consider it more an implementation detail, may have some downsides.

person142 commented 4 years ago

That get_dynamic_class_hook sounds very promising!? Also, and maybe especially, for ndarray itself?

Yeah I'm feeling pretty optimistic about what they could let us do. I opened https://github.com/numpy/numpy-stubs/pull/56 to try out a simple plugin for getting more precise ufunc signatures; if we move forward with that I'll see what we could do in the dtype case.

seberg commented 4 years ago

@person142 just a quick note. I expect/hope that the dtypemeta branch will go into master shortly after branching of 1.19. If it would help you quite a bit to push it into 1.19, that is not totally unrealistic. There is just not a big reason for me to push for that.

person142 commented 4 years ago

If it would help you quite a bit to push it into 1.19, that is not totally unrealistic. There is just not a big reason for me to push for that.

No rush here; I'm fine working from the branch until it's merged. Hopefully I'll have some time this weekend to experiment with writing a plugin using get_dynamic_class_hook.

person142 commented 4 years ago

Closing this; will reopen a reworked version in NumPy focusing on the new dtype classes soon.

numpy / numpy-stubs

ENH: make ndarray generic over dtype #48