Ideas for improving typeshed testing and tooling

AlexWaygood commented 1 year ago

A few days ago, I had a great conversation with @kkirsche offline about some ways we can improve our communications with contributors. One of the things that came up is that it would be good to have a clear outline of specific areas to do with tooling and testing where PRs would be welcomed. With that in mind, here's some areas where I think our testing and tooling could be improved, and where I would welcome contributions.

For any of these tools/ideas: if a PR would be enough work that it'd feel wasteful if it was rejected, please make sure to open an issue discussing the specific change first. If an issue gets a few thumbs-ups, it's a pretty good sign we like the idea, and you should feel free to proceed :)

(Other maintainers: please feel free to edit this!)

Existing tools

Improving existing tools is the "low-hanging fruit": it's always going to be easier to build on existing infrastructure than building something new entirely from scratch.

For each of these tools, I've tried to list a "primary maintainer": this is the person who's most likely to make decisions on a certain tool. That doesn't mean that this person is the only person who's ever going to chime in on a certain tool. For example, @hauntsaninja is the person with the most expertise on stubtest, and people are probably always going to defer to him with regards to that script. But that doesn't mean that other people won't also be interested in reviewing -- @sobolevn and I also have experience with writing stubtest improvements, and may well chime in on a stubtest PR.

Stubtest

Primary maintainer: @hauntsaninja

Stubtest compares what exists at runtime to what we've got in the stub, and checks for inconsistencies between the two. It's one of the most useful tools in our CI, and also really easy to contribute to. If you make a PR for a stubtest improvement, it'll probably be reviewed by some combination of @hauntsaninja, @JelleZijlstra, @sobolevn and/or me within a few days :)

Stubtest improvements generally fall into two categories:

New checks for catching where the stubs are inconsistent with the runtime.
PRs to reduce false-positive errors. Reducing the number of false-positives is very useful as it makes it easier for us to set --ignore-missing-stubs to False for more third-party packages.

What to improve:

There's a bunch of TODO comments in the stubtest codebase. Any of those would be good candidates for PRs.
With regards to adding new checks: think creatively! Stubtest doesn't have a generalised way of comparing the runtime to the stub; it's just a series of "special cases" where we've "hardcoded" in checks for inconsistencies between the stub and the runtime. If you notice something stubtest is missing in CI, that's probably a good candidate for a stubtest PR. If you're unsure about whether the idea is good or not, file a mypy issue first -- if it gets a few thumbs-up, it's probably a pretty good sign that it's a good idea.

Stubgen

Primary maintainer: ?

Stubgen is a tool for autogenerating stub files from source code. It's not a code base I'm particularly familiar with, and I'm not sure it's quite so actively maintained as some of the other tools on this list -- but here's a few ways I think it could be improved:

Make generated stubs comply with flake-pyi "out of the box". Our "typeshed style" has evolved in recent months and years, but stubgen has been a bit slow to catch up. It would be great if the stubs stubgen produced were automatically compatible with flake-pyi, so that we didn't have to do so many manual fixes.
Fewer re-exports (maybe we can already do this with the --export-less option? Maybe we just need to update scripts/create_baseline_stubs.py, typeshed's wrapper script for running stubgen?) By default, stubgen almost always assumes that an import from somewhere else in the package should be treated as an explicit re-export. Unless the file is an __init__.py file, however, I think stubgen should actually generally assume the opposite. This comes up a lot in typeshed code reviews; fixing this would save a lot of time. (Fixed by #8843.)
Add "stubgen-but-with-inference" (@hauntsaninja's idea in https://github.com/python/typeshed/issues/8481#issuecomment-1221383451). Mypy has a wealth of information that it could use to make sophisticated inference about good signatures for generated stubs. But even adding super-dumb rules like "__bool__ methods are generally going to return bool" are going to be correct 99% of the time, and will save time for typeshed reviewers.

flake8-pyi

Primary maintainer: @AlexWaygood

flake8-pyi is a flake8 plugin to check stub files that works through analysing the Abstract Syntax Tree of the stub. It's a tool that's a lot more "dumb" than stubtest, since it doesn't have the full power of a type checker behind it, and doesn't know anything about the runtime either. There's still a lot you can do just with simple AST analysis, however, and it's another code base that's easy to contribute to if you're familiar with the basics of ASTs and the visitor pattern.

Lots of recent checks we've introduced have been fairly style-focused, but we've also added some bugbear-esque checks in the past that are aimed at catching common bugs:

We should continue to think creatively about whether there are other common bugs that we can catch using a simple AST analysis.

flake8-pyi is maintained by the typeshed team, so if you have an idea for a new check, just file an issue! If it gets a few thumbs-up, it's probably a good idea, and you should feel free to make a PR. If we're sceptical, we'll probably say so :)

There's also a bunch of open issues already with ideas for new checks. If it has a few thumbs-up (or even if it's just older than a week or two and hasn't been closed yet), then it's probably an idea that we're receptive to!

mypy_primer

Primary maintainer: @hauntsaninja

Mypy_primer checks out a copy of typeshed prior to a PR and after a PR. It uses "old typeshed" and "new typeshed" to run mypy against millions of lines of open-source code, and reports the difference between the two.

Here's some ways I think we could make mypy_primer even better:

Add more projects to mypy_primer (https://github.com/hauntsaninja/mypy_primer/issues/42). This one's a no-brainer.
Work on the ideas listed at https://github.com/hauntsaninja/mypy_primer#contributing. In particular, having a way to bisect between typeshed commits would be pretty nice.
Figure out ways to analyse "mypy_primer coverage":
- For a given stubs package in typeshed, how many packages included in mypy_primer depend on that stubs package?
- If we figure out a way of answering the above question: how best to communicate that information? Should we have a website somewhere that auto-updates information about mypy_primer coverage? A markdown file in typeshed that a bot auto-files PRs to update, listing mypy_primer coverage information for each stubs package? Etc. etc.

Test cases

Primary maintainer: @AlexWaygood

Our test_cases directory (aimed at providing regression tests for places where stubs have caused problems in the past) is new, and still somewhat experimental. The biggest outstanding issue I see was pointed out by @Akuli in https://github.com/python/typeshed/pull/8066#discussion_r896969019, which is that the test cases are liable to slip out of date with the upstream source code in much the same way that stubs are. How can we ensure that that doesn't happen? Can we add some kind of verification that ensures that the places where we're asserting type checkers shouldn't emit errors do actually pass at runtime?

Stubsabot

Primary maintainer: @hauntsaninja

Stubsabot -- a bot which automatically suggests bumping pinned versions for third-party stubs -- is another new addition to our tooling. Here's some ideas of how we could make it even better:

Fixing #8434
~~Fixing #8754~~
~~Fixing #8778~~
More information in PR messages about what changed between releases. The more information stubsabot provides for us automatically, the easier it is to review stubsabot PRs (and to make our own PRs in cases where the stubsabot update isn't sufficient). Can we get information about whether any new files were added in a release, for example? Could we automatically run create_baseline_stubs.py on any new files that have been added between releases?
After stubs packages have been marked as obsolete, we wait six months before deleting them. We could plausibly have stubsabot automate the deletion PRs, rather than having to manually remember to do the deletion after six months.

The "stubtest failed" bot

Primary maintainer: @hauntsaninja

Note: future status of this tool is TBD. https://github.com/typeshed-internal/stub_uploader/pull/60 means that we can now pin stubs to more specific versions, which should hopefully reduce the number of daily stubtest failures we get. Maybe we'll be able to do away with the check altogether...

Stubtest is run on the full suite of third-party stubs every night, and a bot automatically opens an issue if stubtest failed on any of them (e.g. #8745). Things that would be great to see improved about this workflow are:

Could the bot automatically tell us which stubs stubtest failed on?
Could the bot automatically link to the precise stubtest run that failed, instead of us having to click through and navigate to the stubtest run that's red?
Could the bot automatically paste the results of the failing stubtest run into the issue description?
Can we stop the bot from creating a new issue if there's already a "stubtest failed" issue open?

Miscellany

These are both pretty minor nits, but it would be great to:

Add a test to make sure that the exclude-from-stricter-checks entries in pyrightconfig.stricter.json stay alphabetised, so I don't have to keep annoying contributors that break the alphabetisation ;) tests/check_consistent.py would be a good place for such a check. (Note that most json-parsing tools will refuse to parse pyrightconfig.stricter.json because it's strictly-speaking invalid json.)
Add a test that ensures that nobody writes if sys.version_info >= (3, 7) guards anywhere. (3.7 is the minimum supported typeshed version, so such checks are redundant.) tests/check_new_syntax.py might be a good place for such a check.

Ideas for new tools

These ideas are necessarily going to be more work than the ideas above, as most of them involve writing new scripts from scratch, which would create its own maintenance burden. But here's some pie-in-the-sky ideas for brand new tests/tools we could add.

Note that new tools don't need to be part of typeshed or to run in CI to make a difference. For instance, for a long time, we didn't run stubtest in CI, but during that time we used stubtest to fix many, many errors in typeshed. Keeping a tool separate is a good way to avoid disagreement over things like the balance of false positives vs false negatives, to iterate quickly on the tool, and to prove the tool's value.

More automated fixes:
- Are there any other fast pre-commit hooks already out there that would be useful to add to our CI?
- I love flake8-pyi, but I feel like it's probably more annoying for contributors than something they actively welcome. It's not something I'm planning on working on, because I don't like the idea of the amount of work it would take to get to a tool that's properly tested -- but a bunch of the checks in flake8-pyi feel like they could probably be automated fixes in a pre-commit hook.
- Something I see very often is contributors trying to import TypedDict from typing (for example) -- which doesn't work, since it needs to be imported from typing_extensions in order to be compatible with Python 3.7. Is there an automated way of fixing that? (Can isort already do that, if we have the right settings?)
Use the CPython test suite. We could set up a mypy_primer-esque bot that runs mypy (using typeshed's stubs) against the CPython test suite before and after a PR, and reports on any differences in mypy's results. I have no idea what the signal-to-noise ratio here would be, and I think it might be a fair amount of work to get this to a script we could use in CI -- but maybe worth exploring?
Pyright has a sophisticated type inference engine that can be used to infer types in unannotated code. Can we use that somehow to compare pyright's inferences against our manual stubs? Could we do something similar with mypy using the --check-untyped-defs option?
Maybe some things in the space of making it easier to upstream stubs or validate type hints against the original code? Rough thoughts at https://github.com/python/typeshed/issues/8481#issuecomment-1221383451
Any other ideas?

sobolevn commented 1 year ago

Mypy improvements are also welcome! Anyone can ping me to get help: suggestions / reviews / etc.

srittau commented 1 year ago

Unpinned temporarily to make space for #8956.

jenstroeger commented 1 year ago

Thank you @AlexWaygood for the list of tools, I’ll have to take a closer look at some of them. I was actually about to open an new issue, but after reading through your writeup I decided to comment on this one instead.

Idea for a new tool

My suggestion/question is related to the Allow package references as version specifiers over at Python’s discussion board.

I noticed that a types package here doesn’t necessarily have any version-enforced relationship with its code package. For example, it is perfectly fine to have this installed

protobuf                      3.20.3
types-protobuf                3.19.6

This looks problematic and a checker that warns against such version mismatch would be useful.

A proper solution, methinks, would be to create a dependency between the types-foo package and its related foo package, which is discussed here but might be better suited as an issue on this Typeshed repo?

hauntsaninja commented 1 year ago

Thanks for the suggestion! Agreed that this has been a bit murky (although the murkiness has often worked out favourably for users). To that end, we've recently done a couple things to make versioning between types-* and upstream work better: 1) the new versioning scheme in stub_uploader that should result in it being a little clearer what upstream a given version corresponds to, 2) stubsabot, which helps ensure we keep our stubs up to date.

I would be okay adding upstreams as an extra to types-* packages, but would be strongly opposed to adding a default dependency, especially one that had constraints.

jenstroeger commented 1 year ago

Thanks @hauntsaninja, glad to know that this issue is being at least thought about. Perhaps I should link this comment to the Python discussion or would you like to leave a note there? Just to make sure we cross-pollinate on this issue.

I would be okay adding upstreams as an extra to types-* packages,

Using what name? Then users would have to specify, e.g.

dependencies = [
    "protobuf ==3.20.3",
    "types-protobuf[protobuf] ==3.20.3",
]

or some such?

but would be strongly opposed to adding a default dependency, especially one that had constraints.

Mind to elaborate on the reason?

AlexWaygood commented 1 year ago

Mind to elaborate on the reason?

Wherever possible, stubs packages should be extremely minimal packages that are easy to install, have minimal dependencies, contain no executable code, contain minimal security risks, and are easy to bootstrap.

We've started to relax this policy recently with recent moves working towards allowing non-types dependencies for stubs in some situations (this work is not yet complete). But it's still a principle we should stick to wherever possible.

Avasam commented 1 year ago

Here's an idea, because it happened to me again: If stubtest doesn't know the name of a parameter (kwarg, c-module,...), and it starts with a single _ and it would not shadow a keyword/import if stripped. Then it is likely a mistake and the dev meant to use a positional-only argument.

Or I can wait for mypy to support the / syntax for positional-only arguments. Which I assume may happen with Python 3.7 EOL next year.

AlexWaygood commented 1 year ago

Or I can wait for mypy to support the / syntax for positional-only arguments. Which I assume may happen with Python 3.7 EOL next year.

Yeah, I'm tempted to say it may not be worth it to write a check for this, given that 3.7 EOL is on the horizon :) But if the implementation isn't too difficult, it might be interesting -- writing the check might even reveal a few existing bugs in typeshed!

(Note: I'm not sure exactly when we'll be dropping support for Python 3.7. We haven't really discussed it yet. We waited a few months after Python 3.6 had reached EOL status before dropping support for 3.6, due to the fact that it was still quite widely used when it reached EOL status, and we wanted to give people more time to move over.)

Akuli commented 1 year ago

There aren't that many parameter annotations that start with a single underscore. We could check the existing ones manually.

$ git grep '\(, \|(\)_[^_][A-Za-z0-9_]*:' | wc -l
212

danieleades commented 1 year ago

any appetite for looking at using Ruff for linting? I'm not actually sure what ruff's support for .pyi files is like

sobolevn commented 1 year ago

I like ruff (you can find my name among contributors), but I think that flake8 serves us quite well:

Speed is good enough
Plugins that we use are flake8 specific
Stability

As far as I know, there are no unique checks in ruff that can help us in better typing annotations.

AlexWaygood commented 1 year ago

Ruff seems cool in many ways, but we definitely can't switch to ruff at typeshed.

Flake8-pyi is by far the most important flake8 plugin for typeshed, and it's an important part of our CI. Ruff has only implemented a few of the rules from flake8-pyi, and the rules it has implemented from flake8-pyi have several not-so-great behaviour differences from the way flake8-pyi has implemented them.

Also, flake8-pyi is currently maintained by the same team that maintains typeshed, which makes it very easy for us to make tweaks to flake8-pyi as and when typeshed requires it. Ruff seems very open to PRs, which is great, but there's also the fact that I don't know rust, so I wouldn't be able to make PRs fixing bugs in their logic for the flake8-pyi rules they've implemented.

srittau commented 1 year ago

Since the discussion has mostly died down, I've unpinned this.

python / typeshed