python / mypy

Optional static typing for Python
https://www.mypy-lang.org/
Other
18.41k stars 2.82k forks source link

Mypy "strict mode" for static compilation #1862

Closed datnamer closed 5 years ago

datnamer commented 8 years ago

Problem statement:

"Could there be some way to write a library like numpy so that a single codebase could simultaneously target CPython and the newer compilers, while achieving competitive speed in all cases? If so, what would it take to make that happen? If not, then what’s the next-best alternative?"**

Proposed Solution: a "PyIR" that can be consumed by various compilers. Details here: https://docs.google.com/document/d/1jGksgI96LdYQODa9Fca7EttFEGQfNODphVmbCX0DD1k/edit

Question for Mypy: A cython subset seems to be the recommended source format. Can a "strict mode" mypy be used instead to output this IR? Advantages include more expressive than cython (generics etc), bootstrap off mypy work and less fragmentation.

Excerpts from discussion on gitter:

@njsmith

at pycon this year Jukka and the mypy team were very interested in the idea of somehow using their static type stuff in (something like, this doesn't exist) "strict" mode to help with this

@kmod

I think the issue is that "programmer productivity types" fundamentally != "compiler-userful types"

For example, are subclasses subtypes?

If you pick either yes or no, then the type system becomes non-useful to one community or the other

bonus: Where would dynd's datashape play in? If so, can Dynd's datashape be used as a mypy plugin to annotate array types? @insertinterestingnamehere

datnamer commented 8 years ago

This discussion adresses type vs subclass question https://github.com/python/typing/issues/241

Also does the future label mean this is planned for implementation at some point?

ilevkivskyi commented 8 years ago

I am not sure that this is what you want to discuss here, but at least probably something related.

I have seen a discussion in Cython mailing list some time ago, where the conclusion is that PEP 484 is quite useless for Cython because typing.py does not support close-to-the-machine types like unsigned long.

I don't agree with this, PEP 484 is about some commonly agreed syntax for type information in Python, not about implementation of these types. Currently, Cython uses its own syntax for declaring types, so that Cython code is not a valid Python code. I want to make a translator script, that will take a type annotated Python 3 code and make it Cython code. Of course, it should be accompanied with a stub module (ctyping.pyi?) that contains low-level types like int, unsigned_long, double, etc.

With such a tool one can work with a native Python code, and then try some speed-up by running it in Cython using annotations that are present in the code. These annotations could be checked by mypy, since this is just a native Python.

I think this should be quite simple, since it looks like it is not necessary to go through an AST to perform the translation, it looks like it could be done entirely on the level of lexer/tokenizer. Currently this is on the stage of just an idea, since I don't have time to work on this now, but at some point I will definitely try this.

datnamer commented 8 years ago

Cython is missing features like generic classes. How would you deal with that?

ilevkivskyi commented 8 years ago

@datnamer

Although I have not used it yet, there seems to be some support for C++ templates in Cython: http://docs.cython.org/en/latest/src/userguide/wrapping_CPlusPlus.html#templates . There are also fused types (https://github.com/cython/cython/wiki/enhancements-fusedtypes) that are very similar to constrained type variables like T = TypeVar('T', ctyping.float, ctyping.double). I think this is one of the most common use cases for numeric speed critical code.

I am going to ignore all the types that could not be expressed in Cython, at least at early stages. In principle, it is possible to specialize some generics before translation to Cython, but as I see it, it could be quite complex.

JukkaL commented 8 years ago

@datnamer I forgot about this while I was on vacation. I'm still interested in this topic and have some ideas, though I haven't had time to write anything substantial down.

gvanrossum commented 8 years ago

OK, let's have it.

datnamer commented 8 years ago

Cool :). I think pep 526 can make this much nicer also.

datnamer commented 8 years ago

@JukkaL - It's probably a bit (alot) early for this but sometimes better to be forward thinking: but do you think we would be able to statically resolve things like protocols and a potential future multiple dispatch (together or separately), or would this be restricted more like cython?

Also, would this need fixed width integers to be added?

gour commented 7 years ago

Hello,

I'd like to stay with Python instead of using C(++) or going with some JVM language (Kotlin, Ceylon,...) and wonder of this issue is supposes to bring the feature that after performing mypy analysis on the code, one could take advantage of it and get one's code cython-ized automatically?

refi64 commented 7 years ago

@gour You might find this interesting.

gour commented 7 years ago

@kirbyfan64 I know about it, even asked the question about Nuitka & Mypy on ml (got no reply), but, afaict, Nuitka won't take advantage of type annotations, but is going to do its own independent analysis, right?

datnamer commented 7 years ago

@gour that is my understanding, unless something has changed.

JukkaL commented 7 years ago

As far as I know, Nuitka isn't going to use type annotations. Type analysis without annotations is very difficult for larger programs. I haven't tried Nuitka recently or followed closely what's going on there, but I believe that their approach is kind of hard to pull through, except maybe for smaller programs or smaller performance gains than what I'd hope to see.

Compiling programs with PEP 484 annotations to cython is also not easy to do effectively, since the type systems are quite different. This doesn't mean that PEP 484 annotations can't be used to speed up programs, only that the approach would likely have to somewhat different from what cython does right now.

Compiling programs with PEP 484 annotations to CPython C extension modules seems feasible, and I've done some very preliminary work on it. The compiled programs likely wouldn't have full compatibility with Python semantics. For example, if something is annotated as a list, maybe the compiler would insert a runtime check to ensure that you can't assign a non-list object to the variable. Also, if you call a function, maybe the compiler would (under some circumstances) assume that there is no monkey patching and directly bind to the target function instead of going through a namespace lookup. Cython already can do a bunch of similar things to get good performance, so it wouldn't be anything terribly new.

datnamer commented 7 years ago

I've done some very preliminary work on it.

Is that recent work, or was it the work you blogged about surrounding the initial stages of Mypy?

JukkaL commented 7 years ago

Is that recent work, or was it the work you blogged about surrounding the initial stages of Mypy?

This is recent and mostly separate from the earlier work, though there are obvious similarities.

datnamer commented 7 years ago

Cool. Would it use the full flexibility of the type system, like generics and protocols etc?

JukkaL commented 7 years ago

Cool. Would it use the full flexibility of the type system, like generics and protocols etc?

It's too early say. Likely there would have to be some limitations, but it's unclear what exactly.

datnamer commented 7 years ago

@JukkaL

Have you seen Julia's work on AOT compilation? It can retain pretty much the full dynamicity and expressiveness of the language, including generics, abstractly and untyped functions etc while emitting code within the magnitude of C, or matching it.

http://juliacomputing.com/blog/2016/02/09/static-julia.html

The only catch includes very sane things like can't monkey patch attributes etc...however methods can be added using multiple dispatch.

Is this feasible with mypy? It may require things like multiple dispatch for function specialization selection at call time and LLVM stuff.

For something more python-ey, Numba already has this sort of multiple dispatch. However it is under the hood and doesn't have the same sort of generic and expressive type system features of Mypy or Julia. Perhaps there could be synchronicity between the projects..I think @pzwang can say more about whether any ideas are transferable.

wtpayne commented 7 years ago

I think that any developments in this area are potentially very useful and will be following them with interest.

For what it is worth, for my use-case, I'm quite able to take advantage of code generation even with an extremely restricted subset of the language...

-Will

On 5 December 2016 at 14:57, datnamer notifications@github.com wrote:

@JukkaL https://github.com/JukkaL

Have you seen Julia's work on AOT compilation? It can retain pretty much the fuly dynamicity and expressiveness of the language, including generics, abstractly and untyped functions etc while emitting code within the magnitude of C, or matching it.

http://juliacomputing.com/blog/2016/02/09/static-julia.html

The only catch includes very sane things like can't monkey patch attributes etc...however methods can be added using multiple dispatch.

Is this feasible with mypy? It may require things like multiple dispatch for function specialization selection at call time and LLVM stuff.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/python/mypy/issues/1862#issuecomment-264874939, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPdg0n3QZdJCrLcaQWvggPTr4NphBQ9ks5rFCZUgaJpZM4JLr8Q .

JukkaL commented 7 years ago

@datnamer Here are examples of things that might be limited (or the use of which may limit performance gains):

datnamer commented 7 years ago

That all makes sense.

What do you think about generic type safe classes with type vars?

JukkaL commented 7 years ago

Generic type parameters (including for things like List[int]) would be erased at runtime, similar to Java. The compiled form would have to perform runtime type checks. Consider this example:

def f(x: List[int]) -> int:
    a = x[0]
    ...

The compiled form could behave like this:

def f(x: List[int]) -> int:
    _tmp = x[0]
    if not isinstance(_tmp, int):
        raise TypeError
    a = unbox_int(_tmp)
    ...
datnamer commented 7 years ago

Thanks this for the example. This would have a runtime cost and the function can't be inlined, right? Or can the branch somehow be eliminated.

JukkaL commented 7 years ago

Which function are you thinking about (regarding inlining)? There would be a runtime cost, but we could use low-level C API calls for the isinstance test and unboxing -- these can be pretty fast. To avoid the runtime type check operations, we'd need special collection types that know about item types, similar to numpy arrays (or Java arrays). It might make sense to have a high-performance custom list-like type that could be used for code that can't afford the runtime type checks. Hypothetical example:

def f(x: FastList[int]) -> int:
    a = x[0]  # No runtime type check needed
    ...

a: Any = FastList[int]()  # let's only do runtime type checking here by using Any
a.append(0)  # ok
f(a)  # ok, runtime check for the argument x passes
a.append('x')  # runtime error 
f(FastList[str]())  # call fails, FastList[str] is not compatible with FastList[int]

However, this likely would likely be a potential post-1.0 feature instead of a core part of the project.

datnamer commented 7 years ago

Gotcha thanks. Sorry for all the questions above and below, this all very helpful as I plan out an application.

This cost/check would only be present for functions called at runtime from the interpreter , not compiletime, right?

How about custom data structures from classes- Would we need fixed with ints for attributes? How would this work with structural subtyping, if at all? So do you mean a typvar for a field T be erased and instantiated as an int64 for example?

For inlining, I mean any function I want to use in a loop... my dream is writing my own person class for a simulation which has a immutable stack allocated generic yet typesafe random variable as an attribute that is sampled from in a simulation loop. Or maybe that wouldn't be a good usecase.

JukkaL commented 7 years ago

Answers (sorry these are terse, don't have much time right now):

kmod commented 7 years ago

Sorry to be a contrarian on this subject, but my two cents: I think there are some key questions that should be answered for an attempt like this. For instance: what is the selling point of having something "Python-like" that is not actually Python? Are people really wanting to use a Python-like syntax without Python semantics? Basically, what is it that draws people to Python (I would have said the ecosystem), and is there evidence that this is something this would end up with something that people would actually want + use?

Separately, there are a number of features that are both very dynamic and also very bread-and-butter for Python. How do you handle exceptions? Do you change the behavior of a bare raise statement? Do descriptors still exist? Unfortunately, making Python amenable to static analysis + compilation is not just a matter of removing "bizarre" features, but also very mundane ones as well.

JukkaL commented 7 years ago

This is still mostly an idea. It's still unclear to me whether the approach will be successful (or whether I or somebody else will ever properly implement it). My argument why this might be useful basically boils down to this:

There's a separate question about why would this might be preferable to a JIT compiler that can speed up almost arbitrary Python code. Here are some arguments:

For exceptions, I'd try to get as close to Python semantics as possible. Here performance isn't very important for most code. I think Cython already can do this (but haven't verified). I'd like to not change the behavior of bare raise, but to be honest I haven't given it much thought yet. I'd like to support descriptors if the C API makes it possible. My preference would be support as many features as possible (with the available development resources), even if using some of those features could result in slower performance, assuming that code that doesn't use the feature doesn't have to pay a cost.

One reason for having this issue open is to seek feedback from people who have more experience with these things -- if there turn out to be blockers to this approach, I'd rather hear about them now than waste a lot of effort working on a hopeless project :-) Thanks for the feedback!

wtpayne commented 7 years ago

For me, I'm interested in building a smooth and continuous conveyor belt that takes ideas (algorithms) from inception to production.

I want to be able to quickly model system concepts and prototype algorithms in Python, making full use of libraries like scikit-learn.

Then, after some proof-of-concept engineering gate has been reached, I want to gradually port some small subset of this work (i.e. the successful algorithm concept) to a production environment, maintaining integration with python regression test suites, (MIL) simulations and system models.

The production environment may not permit Python to be used directly -- it may be a real-time embedded platform, or may have security or life-safety requirements that require components to go through a certification process.

Right now, this means re-implementing the relevant parts either in (MISRA) C or Ada, but I am very interested in tools that give me the ability to translate simple numerical algorihms code (Numerical Recipes type stuff) into readable, reviewable and auditable C or Ada.

I.e. I am looking for an open source and Python-oriented alternative to the Mathworks' Simulink Embedded Coder product.

On 6 December 2016 at 21:20, Kevin Modzelewski notifications@github.com wrote:

Sorry to be a contrarian on this subject, but my two cents: I think there are some key questions that should be answered for an attempt like this. For instance: what is the selling point of having something "Python-like" that is not actually Python? Are people really wanting to use a Python-like syntax without Python semantics? Basically, what is it that draws people to Python (I would have said the ecosystem), and is there evidence that this is something this would end up with something that people would actually want + use?

Separately, there are a number of features that are both very dynamic and also very bread-and-butter for Python. How do you handle exceptions? Do you change the behavior of a bare raise statement? Do descriptors still exist? Unfortunately, making Python amenable to static analysis + compilation is not just a matter of removing "bizarre" features, but also very mundane ones as well.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/python/mypy/issues/1862#issuecomment-265275918, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPdgxWN_ljSr6iri3cAo6doXw9bX6CJks5rFdG0gaJpZM4JLr8Q .

datnamer commented 7 years ago

. To avoid the runtime type check operations, we'd need special collection types that know about item types, similar to numpy arrays (or Java arrays). It might make sense to have a high-performance custom list-like type that could be used for code that can't afford the runtime type checks. Hypothetical example:

@skrah Can dynd serve as the above mentioned container type? I know there was some talk of splitting off the various pieces.

JukkaL commented 7 years ago

@wtpayne The idea here is to use C as a "portable assembler", i.e. the code won't be very readable or maintainable.

datnamer commented 7 years ago

And require the interpreter, correct?

JukkaL commented 7 years ago

Sure, that as well -- we'd require the full CPython runtime.

seanjensengrey commented 7 years ago

Shedskin discovers the types in implicitly typed Python programs and transpiles them into C++.

The research behind it is linked from https://gist.github.com/seanjensengrey/572cffee2574ae2adf24f3831b9d9e24

rowillia commented 7 years ago

@JukkaL What you're proposing sounds very similar to the first iterations of HPHPc at Facebook.

As an aside, have you seen @haypo's FAT Python work? Seems like any efforts along these lines could aid with Victor's efforts as well.

JukkaL commented 7 years ago

@seanjensengrey I've looked at Shed Skin before. The approach I'm proposing can be more flexible and should support more Python features and accessing basically arbitrary Python libraries, since we could support dynamically typed values through Any types. Shed Skin expects everything to have a pretty precise type, which makes it hard to use with legacy code that generally doesn't conform to any particular static typing discipline precisely. There are other major differences as well, such as local vs. whole-program type inference, with relatively well-known tradeoffs which I won't discuss in detail now.

@rowillia My understanding is that HPHPc didn't use type annotations to speed up code, but there clearly are other similarities. Also, I have the impression that HPHPc was basically a full reimplementation of PHP, whereas my proposal would still use the normal CPython runtime and libs.

I've briefly looked at FAT Python before. It looks to me that it is doing most/all work at runtime, making it closer to a JIT compiler than what I'm proposing here.

datnamer commented 7 years ago

but FAT python looks to make some guarantees with python code using function guards and now the merged dictionary versioning. I think these would make Mypy's job easier, no?

den-run-ai commented 7 years ago

There is a group of researchers in Tokyo, who work on two-way transpiler from subset of Fortran to type hints with Python 3.5+. They use tools such as this one:

https://github.com/mbdevpl/typed-astunparse

They published a paper at Python HPC:

http://conferences.computer.org/pyhpc/2016/papers/5220a009.pdf

wtpayne commented 7 years ago

Wow -- I didn't know about this one. Thanks for the heads-up!

On 19 December 2016 at 05:06, denfromufa notifications@github.com wrote:

There is a group of researchers in Tokyo, who work on two-way transpiler from subset of Fortran to type hints with Python 3.5+. They use tools such as this one:

https://github.com/mbdevpl/typed-astunparse

They published a paper at Python HPC:

http://conferences.computer.org/pyhpc/2016/papers/5220a009.pdf

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/python/mypy/issues/1862#issuecomment-267882990, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPdg8UlhyrfXgNx_TI9Lw2kwFQWm7P5ks5rJhDagaJpZM4JLr8Q .

datnamer commented 7 years ago

@JukkaL https://opensource.googleblog.com/2017/01/grumpy-go-running-python.html

Have you seen that? Looks quite relevant.

JukkaL commented 7 years ago

@datnamer Thanks for the link! It looks interesting, especially for organizations that are also heavily investing in Go and don't have a large legacy codebase that would make porting hard.

Their approach seems to have a few major practical implications:

datnamer commented 7 years ago

Makes sense. But can't there be some kind of Nogil annotation like in cython? Or does cython have some manual memory management that allows such a thing?

ambv commented 7 years ago

With Cython, any operation that involves Python objects and functions must hold the GIL. The "nogil" function annotation and context managers are for code that is purely C. Otherwise Cython will refuse to compile your code. Memory management in "nogil" sections is whatever C/C++ provides at that point.

JukkaL commented 7 years ago

It might be possible to support some things without the GIL, but for it to be safe, I think that you'd only be able to use certain low-level types that don't require taking the GIL such as numpy arrays and fixed-width integers, and you wouldn't be able to call most functions. Not sure how useful this would be. (I haven't tried the Cython nogil feature but I've seen it mentioned in the docs.)

datnamer commented 7 years ago

From the author of the project, when I suggested a 'dropbox google collaboration ':

"Yes, leveraging type hints for optimization purposes is a long term goal. Thanks for pointing me to [this] issue, I'll keep an eye on it.

One of the goals of open sourcing was to get feedback and work with outside folks so I'm definitely open to collaboration!"

ethanhs commented 7 years ago

A somewhat related project to static compilation is Hermetic. The program takes type annotated Python functions and through Hindley-Milner type deduction generates C code. Sadly H-M type deduction doesn't work well with Python's OOP style, but the project is very impressive work all the same.

alehander92 commented 7 years ago

Hey, I am the author of Airtight(the HM thing). Airtight isn't really implementing Python, it is like an experiment in combining Python's syntax and philosophy with functional programming and stronger types systems.

Actually I have another library: pseudo-python that compiles a static subset of Python to readable/idiomatic code in Go/C#/Ruby/JS(C++/Rust in the making) which is more relevant to the discussion. I planned to use mypy type hints when they stabilize (currently it just does a form of full type inference, which is kinda possible because pseudo is used only for self-contained python code without dependencies). However Pseudo is also implementing a limited part of Python, so it's not a great example for PyIR.

Good and standartized type annotations syntax/semantics are still a very nice part of Python because they make it suitable for writing all kinds of specialized transpilers/generators of code and to easily target languages with rich type systems.

I just saw the link, so I hoped to clear any confusion on Airtight/Pseudo's approach.

ethanhs commented 7 years ago

@alehander42 thank you for clarifying. Pseudo Python looks very interesting indeed! Yes, the PyIR suggestion seems more to be about WebAssembly/LLVM type bytecode to produce faster Python execution. I actually came across https://github.com/sklam/pyir_interpreter, which does seem to implement the idea of a Python IR interpreter.

ilevkivskyi commented 5 years ago

Mypyc is out there for a while and is going well, so I think this may be closed.