The great `types-scipy-sparse` merger 🎉

jorenham commented 1 month ago

@BarakKatzir has been developing the types-scipy-sparse stub package for scipy.sparse. A large portion appears to more complete that the scipy.sparse stubbs in scipy-stubs.

I have talked with @BarakKatzir about joining forces, and we both agreed that gradually merging types-scipy-sparse into scipy-stubs is not only a good idea, but also a feasible one.

@BarakKatzir, how about we could use this issue as central planning on how we can go about this? So e.g. how we can "chop it up" into PR's, and in which order we do so.

69
100

BarakKatzir commented 1 month ago

Good idea. Currently we talked about the first PR including: 1.Moving a single sparse array type (csc / csc etc.) from types-scipy-sparse package to scipy-stubs. Add inheritance from _spbase class (since in my package _spbase isn't type-annotated and you mentioned that despite being private it is used by users).

I probably have to add generics to _spbase as well, since there are some methods which need it (e.g., real). This might also be in the form of generated code.
Regarding the code generation - I'll start by copying my tool (just for the relevant sparse type in the PR). It's a rough tool, but it's a good starting point IMO.

After the first PR we'll have a better idea of issues in the merger and we can plan ahead.

jorenham commented 1 month ago

I probably have to add generics to _spbase as well, since there are some methods which need it (e.g., real). This might also be in the form of generated code.

Do you think that it would help if we defer the Generic bits, and re-introduce them once we have the fundamentals in place? You used __ceil__ as example to show how generics lead to unavoidable code duplication. From that perspective, omitting the generics would be like compressing a file before you transfer it, and decompressing it when all files are received.

Regarding the code generation - I'll start by copying my tool (just for the relevant sparse type in the PR). It's a rough tool, but it's a good starting point IMO.

The codegen/ directory seems like the right place for that. But I'm not really sure on what the best approach for integrating this in the CI, and perhaps also pre-commit, would be. The current codegen stuff is mostly meant for one-time only use, and is therefore not part of the CI.

But if we decide to defer the generics, do you still think it would be needed? Because as I mentioned in #47, I plan to integrate unpy in scipy-stubs soon, which also does code-generation. So I'm sure you can imagine that having two codegens together could become a bit messy, if not done right.

I think that the ideal scenario would be to achieve the same DRY result as your template-based approach through unpy, because unpy code is (and probably always will be) syntactically valid Python. This means that IDE's still understand it, and all of our linters and type-checkers can also be used in the same way.

After the first PR we'll have a better idea of issues in the merger and we can plan ahead.

Yea that makes sense. Sometimes I forget that premature optimization is the root of all evil.

BarakKatzir commented 1 month ago

Okay, you're right. Since this is a big change we should aim at doing it as small as possible, but still in a way we can see how it will scale up. How about I add generic to the child class but only use it on a single method that isn't in _spbase? that way the changes will be contained to a single pyi file?

I don't think unpy is the correct code generation tool for the job, but maybe I'm misunderstanding it. The code generation here is very specific to scipy-stubs, and is not directly related to downgrading stubs to previous python versions. Adding overrides that depend on generics and the specific method at hand doesn't sound like something unpy is intended to do, no?

I'll put a suggestion here and let me know what you think: We can have some meta stub file _csr_meta.pyi that contains a compressed type annotations. This file will not be distributed in scipy-stubs and sit elsewhere in the repo, maybe under codegen/compressed_stubs/. We can write a tool that decompresses it to _csr.pyi which will be distributed under scipy-stubs. Regarding CI we can add a check that compares the generated _csr.pyi with the one currently present in the commit.

In this approach, having the _csr.pyi in the repo is redundant and can be annoying since we have this _csr.pyi as a source file which we don't actually want to edit by hand, but this keeps the build simple. Later on, when we're ready, we can apply unpy to the generated _csr.pyi stub and downgrade it to older python versions.

jorenham commented 1 month ago

Okay, you're right. Since this is a big change we should aim at doing it as small as possible, but still in a way we can see how it will scale up. How about I add generic to the child class but only use it on a single method that isn't in _spbase? that way the changes will be contained to a single pyi file?

Yea that makes sense 👌🏻

I don't think unpy is the correct code generation tool for the job, but maybe I'm misunderstanding it. The code generation here is very specific to scipy-stubs, and is not directly related to downgrading stubs to previous python versions. Adding overrides that depend on generics and the specific method at hand doesn't sound like something unpy is intended to do, no?

You're right in saying that at the moment unpy won't be able to help much with this. But I can imagine that once https://github.com/jorenham/unpy/issues/97 and https://github.com/jorenham/unpy/issues/98 are implemented, that we'd be able to achieve similar DRY results as by using a template-based approach in the case of methods.

I'll put a suggestion here and let me know what you think: We can have some meta stub file _csr_meta.pyi that contains a compressed type annotations. This file will not be distributed in scipy-stubs and sit elsewhere in the repo, maybe under codegen/compressed_stubs/. We can write a tool that decompresses it to _csr.pyi which will be distributed under scipy-stubs. Regarding CI we can add a check that compares the generated _csr.pyi with the one currently present in the commit.

NumPy does something similar with template-based code generation. They use a {module}.py.in naming scheme for templates that generate {module}.py. See e.g. numpy/__config__.py.in. This way IDE's, linters, and type-checkers won't try to parse the template file, which would inevitably cause them to report errors. For the most part, keeping the template in a separate directory could avoid this, you'd still have the IDE that wouldn't be able to parse it (e.g. VSCode only considers filename extensions when selecting a file type I believe).

As I also mentioned in https://github.com/jorenham/unpy/issues/61, I think that for any code-generation approach it is important to make it clear to the user that the code they're looking at is fully generated. Otherwise there's the risk of them submitting PR's that (only) change this generated code, instead of its source template.

I also like numpy's approach of having the template in the same place as the source, so that it's easier to find it if you're looking for it, but forgot it's template-based (which, given my scatterbrain, is bound to happen to me, lol).

In this approach, having the _csr.pyi in the repo is redundant and can be annoying since we have this _csr.pyi as a source file which we don't actually want to edit by hand, but this keeps the build simple. Later on, when we're ready, we can apply unpy to the generated _csr.pyi stub and downgrade it to older python versions.

That seems like a good plan actually; kickstarting it in a dry manner with a (no offense) "quick-and-dirty" approach, and then gradually introducing unpy once we're ready to do so.

jorenham commented 1 month ago

I have some experience with jinja as a templating language. It has great python-like syntax that you can use to create macros and such, and it has very decent IDE integration, which includes support for the the jinja + python combo. So perhaps we could use that? Syntactically, it's actually rather similar to what you've already been using.

See https://github.com/koxudaxi/datamodel-code-generator/ for an example of how it can be used in a python codegen context.

BarakKatzir commented 1 month ago

Wow, I didn't know of jinja. Looks like adapting my tool to a more comprehensive tool will not be such a hassle. The IDE integration is a great feature!

I'll start working on a PR and we can continue refining the framework over there.

jorenham commented 2 weeks ago

How's it going @BarakKatzir ?

BarakKatzir commented 2 weeks ago

Hi, @jorenham. I'm progressing a bit slowly due to deadlines at work (and a family vacation and a sick kid). It took a bit of trial and error to learn how to work with poethepoet and jinja2, but I think I got the basics of them. Also, I'm trying not to downgrade some improvements that scipy-stubs has over types-scipy-sparse such as using @override - but maybe I should just open a PR and we can refine it over there. I'll hopefully open a PR in a couple of days.

jorenham commented 2 weeks ago

I'll hopefully open a PR in a couple of days.

🎉

Also, I'm trying not to downgrade some improvements that scipy-stubs has over types-scipy-sparse such as using @override - but maybe I should just open a PR and we can refine it over there.

I believe that basedpyright is configured to require @override at the moment.

If there are a lot of missing @overloads, then maybe we could also consider writing a libcst codemod for it. But I suspect it'll probably be faster to manually do this when there are <100 or so cases.

jorenham commented 2 weeks ago

It took a bit of trial and error to learn how to work with poethepoet and jinja2, but I think I got the basics of them.

Ah yea there's quite a bit of tooling involved. But in the long run those should all save time. If there's anything I can help, then don't hesitate to ask :)

jorenham / scipy-stubs

The great `types-scipy-sparse` merger 🎉 #129

69

100