Closed simonjayhawkins closed 1 year ago
If we go down that road, we could consider using stub files in cython to make that code valid python (which would allow us to lint)
@WillAyd @jbrockmendel @simonjayhawkins How would you describe current state of things? I've seen that many core components already have pretty good API coverage. Is it realistic to expect this to happen in foreseeable future (let's say next release or two)?
To explain the context of this question ‒ for the last three years I've been working on stub files for Apache Spark. Over this time contact surface between Pandas and PySpark grown significantly, mostly due to introduction and active development of so-called Pandas udfs. Since Pandas doesn't advertise its annotation it effectively creates a growing gap, in practice not covered by type checkers. Additionally the latest upstream developments utilize type hints in Pandas-dependent components, leading conflicts between static type checking, and upstream runtime requirements.
Furthermore lack of actionable annotations leads to rather ugly escalations in case of polymorphic functions, which accept Pandas objects, as well as other types,
Now... For some time, to partially address the problem, I've been using Protocols
and dummy compatibility imports. The idea is basically to:
This approach is not without its own problems, but does the trick. If Pandas is going to PEP 561 these will become obsolete, but if such move is not going to happen any time soon, I will consider formalizing this approach, and agitating for required adjustments in core PySpark. However, given amount of red tape, I'd really like to avoid it :)
@zero323 are you asking if we plan on distributing stub files? If so I would say no for now; we have been annotating inline. Not fully complete but you'll see issues like #26766 used to track that progress
Thank you for the answer @WillAyd.
I didn't mean stubs explicitly, but annotations that are visible to external checkers (which tend to assume no annotations, if no stubs packages or py.typed
files are present).
For me it is basically a decision between waiting a bit longer (if these are expected to be ready soon. From that perspective even dynamic annotations are good enough) or adding elaborate workarounds.
As an end user I am also having the use of annotationg code with say pd.DataFrame
and mypy
interpreting it as Any
.
I am bit confused here. in order to distribute the annotations isn't that enough to add py.typed
file and modify the setup.py
? That way mypy will be able to use the inline annotated code right? or there is something I am missing?
Update: Sorry I did not see https://github.com/pandas-dev/pandas/pull/28831
However I am bit still confused on what is need to make this happen (if it is happening).
I am bit confused here. in order to distribute the annotations isn't that enough to add py.typed file and modify the setup.py?
There is a bit more to that. In general it is good to have annotations that pass type checks with standardish settings, otherwise you're likely to break things downstream.
Related issue: #12412
Need to keep an eye on Microsoft's efforts here: https://github.com/microsoft/python-type-stubs/tree/main/pandas
Regarding our (Microsoft) stubs:
* I believe that if the stub files exist in the package distribution alongside the Python files, then we can use them, and they will take priority over the .py files (@msfterictraut could confirm). So that would be one way we could handle an incremental move to inline types. But it would require package authors to buy into that approach.
@gramster, you're correct that stub files that are packaged alongside the Python source files will take precedence when used by a type checker like pyright or mypy. This is the behavior dictated by PEP 561.
Until recently, library authors had no way to verify the completeness of their type information within a "py.typed" package. We recently added a mode to pyright that analyzes a package and reports missing or incorrect type information. For more details, see this documentation.
also need to keep an eye on https://github.com/cython/cython/pull/3818
@gramster or @erictraut could you give some background on the "content" of the Microsoft stubs? (since we also already have type annotations in pandas as well) Are they much more elaborate than the type annotations we have in pandas? Or did you copy the type annotations for functions in pandas that are already typed, and added stubs for things that were not yet typed in pandas? (to have a bit an idea how divergent they are or how easily/difficult they could be combined somehow)
We added a lot more annotations, and becuase these are stub files, included annotations for a lot of APIs implemented in Cython.
Thanks to all the folks who joined the dev call on Wednesday. Here's a recap of (my understanding of) the current status, and a tentative proposal for how to proceed. More notes at https://docs.google.com/document/d/1tGbTiYORHiSPgVMXawiweGJlBw5dOkVJLY-licoBmBU/edit?usp=sharing.
https://github.com/microsoft/python-type-stubs/tree/main/pandas contains type stub (.pyi
) files for pandas. Microsoft would like to upstream them to pandas:
There weren't any objections from the pandas side to taking on maintenance of the stubs. Most of the discussion was around logistics of actually upstreaming them.
pandas is already using inline types wherever possible. I think there's broad agreement that we want to continue that wherever possible (i.e. everywhere but extension modules).
However, the mere presence of a .pyi
file means that (at least by default) the inline types will be ignored by type checkers for that module. So we don't want to just
dump the contents of that repo into the pandas package. That would essentially nullify our inline types which we've been building up over the past year or two.
My suggestion is to manually go through, file-by-file, and migrate the typestub files to inline types. This will require some effort by maintainers who are experienced with typing. Especially early on, there might be some helpers that need to be migrated first before later files can integrate them. My hope is that once we've ironed out the process, we can engage pandas' pool of volunteer contributors to do the long tail of files.
Finally, we'll need to coordinate with the microsoft/python-type-stubs team once we've upstreamed individual files. Perhaps a checklist on a GitHub issue, or a google spreadsheet. Once a module has been upstreamed to pandas, we'd like to avoid changing it (just) in microsoft/python-type-stubs.
One thing I'm not sure of: what should happen once pandas has completely upstreamed all the files? We'd like the inline types to take precedence by code-completion tools in editors, so they should probably be removed for microsoft/python-type-stubs. But the pyright maintainers mentioned some ancillery benefits to .pyi
files: They can also be statically parsed for docstrings (which isn't possible for things like read_csv
that are dynamically generated). And they're often faster to parse since they're just the types / docstrings, not the code. Perhaps we can regularly regenerate the in microsoft/python-type-stubs from the types packaged with pandas?
One small point is that read_csv
is no longur dynamically created. read_table
is also a normal function.
I don't know if that changes the position of the pyright maintainers, or if this a more general problem...
I think for us, we will either come up with an alternate solution to the docstring problem or regenerate full stubs from the pandas source and ship those with docstrings added. We’d obviously prefer that docstrings not be dynamically created but we have workarounds if they are.
On Fri, Jan 15, 2021 at 6:15 AM Terji Petersen notifications@github.com wrote:
One small point is that read_csv is no longur dynamically created. read_table is also anormal function.
I don't know if that changes the position of the pyright maintainers, or if this a more general problem...
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/28142#issuecomment-760965958, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVCPCDNSVMM33LJF7OB65TS2BEW3ANCNFSM4IPKIX7A .
Let's make sure that we track the two use cases of typing:
mypy
as is tested in the CI todayFor (2) we need to figure out a way to test that we have properly typed the public API. The work that @gramster has done provides the typing for much of the public API, and that's what we want to integrate in. I still don't know how we create a set of tests for CI that help us verify that the typing of the public API is correct.
One thing we could consider doing now is to add a py.typed
file at the top of the pandas distribution, which would mean that type checkers like mypy
would look inside the pandas code to check user code. (I verified this with a simple test)
@gramster if we were to do that, how would pyright
handle type checking for users with the presence of the type hints in the pandas distribution (with the py.typed
file present), plus the type hints that you are shipping with pyright? Would the latter just be ignored because of the presence of the py.typed
file in the pandas distribution? Or if a hint wasn't found in the pandas distribution, then the stubs shipped with pyright
would be checked?
For (2) we need to figure out a way to test that we have properly typed the public API. The work that @gramster has done provides the typing for much of the public API, and that's what we want to integrate in. I still don't know how we create a set of tests for CI that help us verify that the typing of the public API is correct.
I'm not sure I understand the need for this distinction. From the point-of-view of mypy, why would anything in the public API be different? I think that either way we get a set of internally consistent types.
We do still have the problem of declaring an incorrect type (which IIUC are the types of bugs you reported to that repo). For that, I think we just have to rely on
I'm one of the primary authors and maintainers of pyright, and I work with Graham. Sorry I couldn't attend the recent meeting. I'm really excited about the discussion and the progress!
@TomAugspurger, to answer your question about what happens once pandas has completely inlined all of the type information from the stubs. I'm not worried about parsing speeds. Pyright can parse a file very quickly and create an in-memory AST. It performs all type evaluations lazily, using only the parts of the AST that are required to resolve a given type. So I'm not worried about the performance implications of moving to inlined types. (It is much more costly to infer return types when functions are not fully annotated, but if you fully annotate the public interface surface, that's not an issue.)
Dynamically-generated functions are a challenge, but we can work around those if necessary (e.g. conditionally import a stub or statically declare a symbol under a TYPE_CHECKING test).
@Dr-Irv, If you add a py.typed
file, Pyright will still prefer stubs if they are present. The source would be ignored in this case. However, consider that py.typed
generally implies that you have accurate and complete types within the package. You may want to wait until you've finished integrating the stub information before you do that.
I recently added a mode to the CLI version of Pyright that reports the "type completeness" of a package's public interface surface. It identifies all public symbols, evaluates their types, and reports which ones are missing type information (e.g. a parameter or a return type without an annotation). Several library authors have started to use this tool. It could be incorporated into a CI pipeline. For more details, see this documentation. I welcome any feedback and am willing to make changes and improvements based on your suggestions.
For (2) we need to figure out a way to test that we have properly typed the public API. The work that @gramster has done provides the typing for much of the public API, and that's what we want to integrate in. I still don't know how we create a set of tests for CI that help us verify that the typing of the public API is correct.
I'm not sure I understand the need for this distinction. From the point-of-view of mypy, why would anything in the public API be different? I think that either way we get a set of internally consistent types.
Let's take a method (property) like DataFrame.axes
. Right now, we have a type declaration for that in the source. But if there is no internal pandas code that calls DataFrame.axes
, then mypy
wouldn't pick up if we had the declaration wrong. Now, if we ran mypy
on our testing directory, then we might pick it up, but I think we avoid doing that now in the CI (please correct me if I'm wrong).
I'll also point out that mypy
will not analyze any code that is not annotated. That means it's blind to a lot of errors in unannotated code or code that calls unannotated code. Pyright has a bunch of sophisticated type inference logic and will analyze all code the best it can with the types it can infer, so you may get better results using it over mypy (or in addition to mypy) for your internal verification.
@erictraut wrote:
I recently added a mode to the CLI version of Pyright that reports the "type completeness" of a package's public interface surface. It identifies all public symbols, evaluates their types, and reports which ones are missing type information (e.g. a parameter or a return type without an annotation). Several library authors have started to use this tool. It could be incorporated into a CI pipeline. For more details, see this documentation. I welcome any feedback and am willing to make changes and improvements based on your suggestions.
Just to see what happened, I took the released version of pandas 1.2, and just put a py.typed
file in the pandas distribution and tried this out. Lots of things are reported, but there is a bug:
error: An internal error occurred while verifying types: "RangeError: Maximum call stack size exceeded
followed by a call stack. I tried starting node
with --stack-size=16000
and then obtained no output and this message:
Emptying type cache to avoid heap overflow. Heap size used: 1553MB
Having said that, the report that was printed before the internal error reveals that there are a number of classes/methods that pyright inferred as being part of the public API that are not, for example:
error: Type partially unknown for class "pandas.core.algorithms.SelectNFrame"
It tried to do verification on subpackages of pandas.core
and pandas.compat
among others. So we do need a way to distinguish the public API from the internal one. I know this is difficult for pyright to figure out, but suggestions from you on how to modify the pandas code to do make it easier would be worth discussing, or if there was a way to indicate which packages to check with the --verifytypes
option would be useful.
We've established deterministic rules for distinguishing public from private symbols. These have already been discussed in the typing-sig, and we have general agreement on them, although they haven't yet been included in a ratified PEP. They are all included in the document I linked to above. As I said, these rules should be unambiguous and deterministic. If you have any questions or suggestions, let me know.
I can take a look at the internal error. It's quite possible that it's struggling to deal with really deep call chains in areas of the code that have no return type annotations, which requires it to do a bunch of work to try to infer return types based on the code.
I just tried to repro the problem you're seeing with an internal error. I installed the release version of pandas 1.2 and added a py.typed
file, like you did. I'm not able to repro the internal error you're seeing. It completed fine for me. Here are the results:
Public symbols: 21152
Symbols with unknown type: 19761
Functions with missing docstring: 14553
Functions with missing default param: 3
Classes with missing docstring: 1208
Type completeness score: 6.6%
Completed in 33.121sec
Any thoughts on what I might be doing different from you?
Any thoughts on what I might be doing different from you?
Maybe our versions of node, python and pyright are different? I'm also doing this on Windows
Here's my version list:
Also to run it, I had to do a cd c:\Anaconda3\envs\base38\Lib\site-packages
so it would find pandas.
Your configuration looks similar to mine. The only difference is that I'm running on MacOS, using a virtual environment, and using Python 3.9. I doubt if any of those would make a difference though.
It looks like a big portion of the untyped symbols are in the "pandas.tests" module and its submodules. Is it intended that the pandas tests are distributed with the released package? Do you intend for consumers of pandas to directly import pandas.tests.*?
If I manually delete the tests
directory, the type completeness report looks much better (and runs much faster).
Public symbols: 4996
Symbols with unknown type: 3841
Functions with missing docstring: 721
Functions with missing default param: 3
Classes with missing docstring: 103
Type completeness score: 23.1%
Completed in 3.919sec
@Dr-Irv I think we've seen issues in Pylance (your pytest one) where we couldn't reproduce something that you could; maybe there's some problem with using conda (where we were just using pip). I thought that --verifytypes
would work regardless of the directory you're in, as it accepts a module path and is supposed to go through search paths to resolve the import (as it would do in user code).
It looks like a big portion of the untyped symbols are in the "pandas.tests" module and its submodules. Is it intended that the pandas tests are distributed with the released package? Do you intend for consumers of pandas to directly import pandas.tests.*?
No, and we have an issue about that: https://github.com/pandas-dev/pandas/issues/30741
So, I just did mv tests _tests
and then it ran fine, but my statistics are different:
Public symbols: 317
Symbols with unknown type: 277
Functions with missing docstring: 26
Functions with missing default param: 0
Classes with missing docstring: 4
Type completeness score: 12.6%
Any idea as to why I would only pick up 317 public symbols and you picked up 4996?
As @jakebailey said, you shouldn't need to switch to the pandas directory to run the analysis. If you find that's required, that points to something wrong with your configuration. You should set up your Python environment (e.g. using a virtual environment) such that if you run "python3", you'll get the correct environment where pandas is installed. Pyright will use the current Python environment when it's invoked.
Here are the command-line switches I'm using: pyright --verifytypes pandas
.
In VSC, the Python extension chooses the interpreter, so we don't have to pick a binary (versus the CLI), but I think both hint at some weird path oddities. Maybe some more verbose logs could list out the search paths being used. (I'd test conda, but arguably I shouldn't be "working" on a weekend... 🙂)
As @jakebailey said, you shouldn't need to switch to the pandas directory to run the analysis. If you find that's required, that points to something wrong with your configuration. You should set up your Python environment (e.g. using a virtual environment) such that if you run "python3", you'll get the correct environment where pandas is installed. Pyright will use the current Python environment when it's invoked.
OK, I figured it out. When I first ran it, I ran from the folder that contains the pandas source I use for development. When I ran from a different directory, everything was fine and picked up 4996 public symbols.
@erictraut I had an idea to use info from the way we generate docs to determine the public API that is missing typing info by matching the pyright report to a list we have for doc generation, but found that pyright --verifytypes
is reporting errors on things that I don't think it should. Here's one example:
error: Type partially unknown for variable "pandas.core.series.Series.axes"
Type partially unknown for class "property"
Type partially unknown for function "fget"
Return type partially unknown
Type partially unknown for function "__get__"
Return type partially unknown
And here's the declaration for Series.axes
:
@property
def axes(self) -> List[Index]:
(Note - if you think we should move the issues with pyright --verifytypes
to an issue in pyright
, let me know.)
The other issue in doing this matching is that our public API is Series.axes
because Series
is imported from pandas.core.series
in pandas/__init__.py
, so pyright
reports the error as being in pandas.core.series.Series.axes
rather than pandas.Series.axes
and I don't see an easy way to manage that mapping.
This does look like a potential bug in pyright's type verification logic. I'd appreciate it if you'd open an issue in the pyright repo. I'll take a look at it.
All types are reported using their fully-qualified names and not aliases re-exported from higher-level submodules. I could look at emitting all aliases in addition to the fully-qualified name. That could be tricky, but I'm willing to investigate if it would be useful to you.
If you're using the output of verifytypes
in an automated manner, you might be interested in the fact that pyright also supports a JSON output (pyright --verifytypes pandas --outputjson
). This gives you structured output that is easier to use in automated tooling.
@erictraut I've started fixing a few simple things using --verifytypes
with pyright to find them. But I'm stuck in how to handle arguments in pandas methods that come from numpy
. numpy
currently doesn't have type stubs, and in VS Code, there are type stubs provided there, but they are not in the https://github.com/microsoft/python-type-stubs repo, so how can I get the command line version of pyright
to find some version of numpy
type stubs?
Yeah, numpy types will be a problem. In the long term, we'll want to get the numpy maintainers to do the same thing you're doing for pandas — to include inlined types and packaged stubs in the numpy distribution. According to their roadmap, numpy 1.20 will contain type annotations.
I see three options:
_dummy_numpy_types.pyi
and conditionally import this within an if TYPE_CHECKING:
statement so they are used only by type checkers.One way for you to make fast progress in some areas is to integrate the stub files for the _libs
directory. Simply copy the .pyi
files from the python-type-stubs repo.
cc @BvB93 since you've been working on NumPy type annotations (just for your awareness, of https://github.com/pandas-dev/pandas/issues/28142#issuecomment-766128669. I think nothing to do right now).
@Dr-Irv What version of NumPy do you have in your environment? Can you try with NumPy master? Also some docs at https://numpy.org/devdocs/reference/typing.html that might be helpful.
@Dr-Irv What version of NumPy do you have in your environment? Can you try with NumPy master? Also some docs at https://numpy.org/devdocs/reference/typing.html that might be helpful.
I had numpy 1.19.5 . Have now created an environment using numpy master and checking pandas types using that.
- Download and install numpy 1.20 from sources rather than using the currently-published 1.19.
@erictraut Tried this and now the pyright --verifytypes
on pandas is taking a long time (over 5 minutes). Will open up issue in pyright repo for that.
My suggestion is to manually go through, file-by-file, and migrate the typestub files to inline types. This will require some effort by maintainers who are experienced with typing. Especially early on, there might be some helpers that need to be migrated first before later files can integrate them. My hope is that once we've ironed out the process, we can engage pandas' pool of volunteer contributors to do the long tail of files.
I'm not sure the file-by-file method is the way to go. The pyright
tool can give us some guidance.
I've been using the pyright --verifytypes
to analyze where we are with pandas. It helps to pinpoint what things to work on first, because it outputs a message of the form "error: Type partially unknown for..." for each missing thing, and if the same class/function is used from other pandas classes/functions, it will appear multiple times. So I ordered them and started hacking away.
In master, it reported 18,161 "Type partially unknown" messages.
After the simple PR I submitted at #39348 that reduced the number to 16654.
I have made changes locally to fix ExtensionArray
and ExtensionDtype
and that lowers it to 15199. I'll submit a PR for that once #39348 is merged in.
So this is at least a way to find the places that are missing things. The next one I plan to look at is Index
.
Note that the differences between the stubs in our repo and the ones in pandas aren't just completeness (or lack of unknowns, which we still have), it's also extra coverage and a few more overloads, so checking verifytypes won't quite be the same as actually checking for differences, I don't think.
Have we considered using PEP 561 to do type hinting for pandas dataframes? For instance a dataframe could be represented as Annotated[pd.DataFrame, Columns[('col1', float), ('col2', int), ...], Shape[N, 3]]
or something like that?
I assume you mean PEP 593 (which is where Annotated
comes from).
For the most part Annotated
doesn't really have a meaning to type checkers unless you start making heavy use of plugins to start escaping the type system and do fancy stuff (which means locking into a specific type checker for the more interesting stuff).
You may be interested in PEP 646, which is likely to be ratified soon and is available in pyright and (I think) pyre.
Dear pandas team, @jreback, @TomAugspurger, is there a timeline about integrating typing stubs into pandas? Thanks
There is an ongoing effort to improve annotations, but no specific timelines.
Reckon it's OK to add a py.typed
file for 1.4, even though annotations aren't complete yet?
If you add py.typed, that means that the pandas lib should be treated as type complete and the canonical source of info, and type checkers will prefer it over any other installed stubs (including stubs shipped by Pylance, various other stub projects); if the typing isn't complete yet, it may be a worse experience without much of a workaround (the preference is defined in PEP 561).
I know there are a lot of types for pandas still in Microsoft's python-type-stubs repo; I'm not sure where the effort to merge those in here for parity currently stands.
@jakebailey
If you add py.typed, that means that the pandas lib should be treated as type complete and the canonical source of info, and type checkers will prefer it over any other installed stubs
This doesn't seem right. Since you mentioned PEP 561 it states that first in the resolution order are
- Stubs or Python source manually put in the beginning of the path. Type checkers SHOULD provide this to allow the user complete control of which stubs to use, and to patch broken stubs/inline types from packages. In mypy the $MYPYPATH environment variable can be used for this.
and inline hints have second lower precedence (with only typeshed being lower).
- Inline packages - if there is nothing overriding the installed package, and it opts into type checking, inline types SHOULD be used.
So with inline packages you can still override any type hint, just cannot depend on the typeshed.
This can be easily tested (corresponding code can be found here)
As of that:
be treated as type complete
partial
should apply to standard packages as well
Regular packages within namespace packages in stub-package distributions are considered complete unless a
py.typed
withpartial\n
is included.
Reckon it's OK to add a
py.typed
file for 1.4, even though annotations aren't complete yet?
I'd like to discuss the idea I sent out via email about maintaining the stubs as a pandas-stubs
project that can be maintained independently of whatever we put in the pandas
project. We could start by pulling over the Microsoft stubs.
Let's discuss tomorrow - there'd need to be a way of making sure it doesn't go out-of-date. Also, there's already https://github.com/VirtusLab/pandas-stubs
xref https://github.com/pandas-dev/pandas/pull/28135#issuecomment-524659775
do we want to make pandas PEP 561 compatible?
https://mypy.readthedocs.io/en/latest/installed_packages.html#making-pep-561-compatible-packages