sizmailov / pybind11-stubgen

Generate stubs for python modules
Other
218 stars 44 forks source link

Option to use TypeVars for numpy.ndarray's shape type argument #187

Closed ringohoffman closed 7 months ago

ringohoffman commented 7 months ago

Type expressions like numpy.ndarray[numpy.float32[4, n]] (see also #155) is not valid according to numpy.ndarray's type stubs. It expects 2 type arguments, a shape and a dtype. While #113 and #115 have suggested that we use numpy.typing.NDArray, it is just a type alias that provides no shape information, which we do have.

I would like to propose a new numpy stub generation flag that formats the pybind11 style:

numpy.ndarray[numpy.float32[m, 1]]

instead as:


M = typing.TypeVar("M")
...  # any others

numpy.ndarray[tuple[M, typing_extensions.Literal[1]], numpy.dtype[numpy.float32]]

This would allow us to create stubs that support shape checking like this:

from __future__ import annotations

from typing import TypeVar

import numpy as np
from typing_extensions import Literal

M = TypeVar("M")
N = TypeVar("N")
P = TypeVar("P")

def matmul(
    l: np.ndarray[tuple[M, N], np.dtype[np.float64]],
    r: np.ndarray[tuple[N, P], np.dtype[np.float64]],
) -> np.ndarray[tuple[M, P], np.dtype[np.float64]]:
    return l @ r

arr1: np.ndarray[tuple[Literal[2], Literal[3]], np.dtype[np.float64]] = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float64)
arr2: np.ndarray[tuple[Literal[3], Literal[4]], np.dtype[np.float64]] = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]], dtype=np.float64)

arr3 = matmul(arr1, arr2)  # pyright reveals the type is: ndarray[tuple[Literal[2], Literal[4]], dtype[float64]]
arr3.shape  # (2, 4)
sizmailov commented 7 months ago
M = TypeVar("M")
N = TypeVar("N")
P = TypeVar("P")

def matmul(
    l: np.ndarray[tuple[M, N], np.dtype[np.float64]],
    r: np.ndarray[tuple[N, P], np.dtype[np.float64]],
) -> np.ndarray[tuple[M, P], np.dtype[np.float64]]:
    return l @ r

Is such :point_up: syntax supported by any tool? Is it any better than following :point_down: ?

def matmul(
    l: np.ndarray[tuple[Literal["M"], Literal["N"]], np.dtype[np.float64]],
    r: np.ndarray[tuple[Literal["N"], Literal["P"]], np.dtype[np.float64]],
) -> np.ndarray[tuple[Literal["M"], Literal["P"]], np.dtype[np.float64]]:
    return l @ r
ringohoffman commented 7 months ago

Is such ☝️ syntax supported by any tool?

Yes the revealed types that I am sharing are from pyright, which is the type checker built into VS Code. It is supported.

This is the work motivating me: https://github.com/microsoft/pyright/discussions/6454#discussion-5853852

You can see that this approach is working in practice, though when it comes to creating TypeVars for generic classes, I think we will end up wanting an int analogue to LiteralString. But we shouldn't need it in type stub generation.

Is it any better than following 👇 ?

Since Literal["M"] is not a type variable, your return type will always be just np.ndarray[tuple[Literal["M"], Literal["P"]], np.dtype[np.float64]]. Also this would mean anything other than Literal["M"] would not be compatible with Literal["M"]:

import numpy as np

from typing import Any, Literal

def matmul(
    l: np.ndarray[tuple[Literal["M"], Literal["N"]], np.dtype[np.float64]],
    r: np.ndarray[tuple[Literal["N"], Literal["P"]], np.dtype[np.float64]],
) -> np.ndarray[tuple[Literal["M"], Literal["P"]], np.dtype[np.float64]]:
    return l @ r

arr1: np.ndarray[tuple[Literal["A"], Literal["B"]], Any] = np.ndarray(shape=(1, 2))
arr2: np.ndarray[tuple[Literal["B"], Literal["C"]], Any] = np.ndarray(shape=(2, 3))

arr3 = matmul(arr1, arr2)  # revealed type is ndarray[tuple[Literal['M'], Literal['P']], dtype[float64]]

You would instead say something like:

import numpy as np

from typing import Literal, LiteralString, TypeVar

M = TypeVar("M", bound=LiteralString)
N = TypeVar("N", bound=LiteralString)
P = TypeVar("P", bound=LiteralString)

def matmul(
    l: np.ndarray[tuple[M, N], np.dtype[np.float64]],
    r: np.ndarray[tuple[N, P], np.dtype[np.float64]],
) -> np.ndarray[tuple[M, P], np.dtype[np.float64]]:
    return l @ r

arr1: np.ndarray[tuple[Literal["M"], Literal["N"]], np.dtype[np.float64]] = np.ndarray(shape=(1, 2))
arr2: np.ndarray[tuple[Literal["N"], Literal["P"]], np.dtype[np.float64]] = np.ndarray(shape=(2, 3))

arr3 = matmul(arr1, arr2)  # revealed type is ndarray[tuple[Literal['M'], Literal['P']], dtype[float64]]

While this works, I don't think it is the right choice. There is no way to infer these literal string values from normal constructors of arrays/matrix types like there is for shapes; the user would always be required to declare them like I did above. And in the other examples I have seen of this, the dimensions are modeled as TypeVar bound to int. I think it would be unexpected to use str.

What you are describing does remind me of this NewType example I saw in the PEP for TypeVarTuple:

from typing import NewType

Height = NewType('Height', int)
Width = NewType('Width', int)

x: Array[float, Height, Width] = Array()

Which I think maybe gets at what you were wanting to achieve. You can use these NewType in functions like this:

Shape = TypeVarTuple('Shape')
Batch = NewType('Batch', int)
Channels = NewType('Channels', int)

def add_batch_axis(x: Array[*Shape]) -> Array[Batch, *Shape]: ...
def del_batch_axis(x: Array[Batch, *Shape]) -> Array[*Shape]: ...
def add_batch_channels(
  x: Array[*Shape]
) -> Array[Batch, *Shape, Channels]: ...

a: Array[Height, Width]
b = add_batch_axis(a)      # Inferred type is Array[Batch, Height, Width]
c = del_batch_axis(b)      # Array[Height, Width]
d = add_batch_channels(a)  # Array[Batch, Height, Width, Channels]