How does Mojo plan to make itself "Pythonic" before launch?

ParadaCarleton commented 1 year ago

Review Mojo's priorities

[X] I have read the roadmap and priorities and I believe this request falls within the priorities.

What is your request?

A plan for fixing difficult-to-read code, like the example here. How does Mojo plan to make itself Pythonic before release?

What is your motivation for this change?

Seeing this example function in the docs, which makes me extremely nervous about Mojo becoming yet another D or Julia (easier than C++, but not easy enough to make people switch):

fn matmul_vectorized_1(C: Matrix, A: Matrix, B: Matrix):
    for m in range(C.rows):
        for k in range(A.cols):
            @parameter
            fn dot[nelts : Int](n : Int):
                C.store[nelts](m,n, C.load[nelts](m,n) + A[m,k] * B.load[nelts](k,n))
            vectorize[nelts, dot](C.cols)

Comparing with Julia, another language for ML backend development:

using LoopVectorization
using StarSlice

function *(C::Matrix, A::Matrix, B::Matrix)
    for m in axes(C, 1)
        for k in axes(A, 2)
            @turbo @. C[m, *] += A[m, k] * B[k, *]

Where @. vectorizes, * slices an array along the given dimension, and @turbo automatically tunes parameters like nelts.

Any other details?

Some big problems that make the mojo implementation difficult for me to read:

The use of [x] for both indexing and parameters; I can't tell which is which without making a conscious effort to understand.
Why is it necessary to pass nelts as a parameter to the function? In Mojo, from what I can tell, array sizes are included in array types, so shouldn't this automatically be inferred?
The redundant vectorize function, which looks like it does the same thing as map/vmap but with a name twice as long and incompatible with base Python/Jax code. I understand vectorize is in Numpy, but it's there for historical reasons--mostly since the name map is already taken by base Python--so why not just use map?
What @parameter does is extremely unclear to me.

soraros commented 11 months ago

I share your sentiment that Mojo could feel more pythonic, however, it's unreasonable to assume the kind of kernel program you show above will look anything like Python you see in the wild today, which are higher level.

Clean code for toy example doesn't necessarily scale to more complex problems. Be reminded that the Julia matmal is not written in shown style, it calls BLAS or something equivalent, while the said 2k line Mojo matmal in MAX is written in (most likely) this exact style. What the Mojo entry points grant you is very high programmability, and I it as fitting for this aspiring "language without magic". If you want see more proof why the Julia approach is not as easy/straight forward as you think, check this out. The only version, contract the most people (who didn't benchmark) in the that thread think, that run as fast as the Mojo version is the one written exactly like Mojo (verbose and manual), if not uglier.

Actually, I think one should see it a pleasant surprise that low level Mojo is still comparable to high level Julia.

The use of [x] for both indexing and parameters; I can't tell which is which without making a conscious effort to understand.

I agree it can be a problem, which I haven't see while playing with Mojo. On the bright side, it is consistent with (the latest version of) Python. Related, in type theory, for a generic type List[N: Int] (fixed length vector), we can also say it's a (type) family List indexed by Int.

Why is it necessary to pass nelts as a parameter to the function? In Mojo, from what I can tell, array sizes are included in array types, so shouldn't this automatically be inferred?

I think you misunderstood the meaning of nelts. It's the simd width, not size of the whole array. It's not shown here, it's the kind of parameter that needs to be exposed in order to use autotune in a non-magical way.

The redundant vectorize function, which looks like it does the same thing as map/vmap but with a name twice as long and incompatible with base Python/Jax code. I understand vectorize is in Numpy, but it's there for historical reasons--mostly since the name map is already taken by base Python--so why not just use map

There are many things that can be said about this comment. As much as I like the API of JAX, and want to use it in a Mojo library one day, there is nothing standard in the naming or semantics of vmap. As proof of this, there is also jax.shmap(f) which sends chunks of data with the mapping axis to f unlike vmap. vectorise mean something totally different in Numpy. As for the name of the Mojo function, even you called this action vectorize (in "Where @. vectorizes"). The name coms from the familiar term loop vectorization, and the function does exactly that, I don't see how that's redundant. If you take a look at other functions from the same module (algorithm.functional), you will see they are all low-level (unlike map or vmap or np.vectorize) functions that work on the same level. map will be a name too general to be meaningful.

What @parameter does is extremely unclear to me.

The doc might be helpful.

AriMKatz commented 10 months ago

@soraros I'm familiar with that Julia macro and it definitely does not call blas or any precompiled low level kernel, but instead compiles simd parallel kernels on the fly from pure Julia.

soraros commented 10 months ago

@AriMKatz Are you sure? The doc seems to say otherwise:

Linear algebra functions in Julia are largely implemented by calling functions from LAPACK.

To clarify, I meant in my last post: the Julia stdlib matmal is written Fortran (not macro powered Julia code), and Mojo in Mojo. This is also why I called the code snippet you shown "toy".

AriMKatz commented 10 months ago

@soraros I'm positive. That's not a linear algebra function, it's a custom loop compiled by a third party library

AriMKatz commented 10 months ago

And that's not true, Julia has Fortran kernels but also pure Julia dispatches for many functions l. Check out linearalgebra.jl

soraros commented 10 months ago

@AriMKatz How is matmul (dot) not a linear algebra function? Dense matrix multiply (which is the variant of matmul in question) is defiantly using BLAS:

julia> using LinearAlgebra

julia> @code_lowered dot(rand(10, 10), rand(10, 10))
CodeInfo(
1 ─ %1 = LinearAlgebra.BLAS.dot
│   %2 = (%1)(x, y)
└──      return %2
)

AriMKatz commented 10 months ago

@soraros that's only one code path. I said it has many generic dispatches, and they are here https://github.com/JuliaLang/julia/blob/cf44b530a4499825ba4b7da5d2eccc1611e8c14b/stdlib/LinearAlgebra/src/generic.jl

There's also a generic linear algebra package that rounds out the rest.

The loop with turbo macro is completely Julia codegen

soraros commented 10 months ago

@soraros that's only one code path. I said it has many generic dispatches, and they are here JuliaLang/julia@cf44b53/stdlib/LinearAlgebra/src/generic.jl

There's also a generic linear algebra package that rounds out the rest.

The loop with turbo macro is completely Julia codegen

It's arguably the most crucial code path. It's a fact that Julia DOESN'T generate code for this canonical case which people call matmul: dense float matrix multiply. It's rather irrelevant to our discussion whether "matmul" for some batch matrix type is implemented in Julia, because the heavy lifting is still done by BLAS. If you actually read the code for dot in generic.jl, you shall see it's mostly dispatching and doesn't implement much, only delegate computation to other, maybe more primitive dot like BLAS.dot. Similar things can be seen for functions in stdlib/LinearAlgebra/src/matmul.jl.

Vectorisation, tiling and the like have to happen somewhere; Julia is not magic, we don't see those implementations simply because they are not there. I'm all for a future where Julia matmul is generated by idiomatic Julia powered by some form @turbo, it is currently not the case, simply because the technology is not there yet. Let's don't forget that matmul is not as simple as nested vectorised loop, researchers don't spend decades on just that!

I'm pretty confident because I asked Chris Rackauckas personally in JuliaCon Eindhoven about replacing BLAS etc. with native. He said it's nowhere near prime time but LoopVectorization.jl is actively being worked on. LV.jl's main developer is taking a vacation from his day job at SciML to work on it full time (I'm not sure that's how vacations work). To be fair, most sane projects (Julia included) don't write their own matmul, and there is certainly no shame calling BLAS, Modular is the weird one here.

AriMKatz commented 10 months ago

I think we were talking past each other as I didn't see your edit.

ParadaCarleton commented 9 months ago

. If you want see more proof why the Julia approach is not as easy/straight forward as you think, check this out. The only version, contract the most people (who didn't benchmark) in the that thread think, that run as fast as the Mojo version is the one written exactly like Mojo (verbose and manual), if not uglier.

This is referring to a different program (the Mandelbrot program), not the matrix multiplication. You're right that nobody in the Julia thread managed to find a way to make the Mandelbrot program look better and faster than the Mojo version at the same time, but that seems like a harder problem. My goal isn't making every program super fast with no effort; it's to remove pointless effort. If Mojo can optimize a piece of code just by having me tell it to autotune a parameter at the front, why do I have to tell it to autotune?

I'm also not sure why Julia being ugly in this case would imply Mojo should be too; this is a criticism I've levied at Julia many times too. The lack of automatic optimizations (e.g. fused multiply-add) is part of why I've moved towards using Jax more often in newer projects.

soraros commented 9 months ago

This is referring to a different program (the Mandelbrot program), not the matrix multiplication.

Yes, and matrix multiplication is a considerably more difficult problem.

I'm also not sure why Julia being ugly in this case would imply Mojo should be too; this is a criticism I've levied at Julia many times too.

Because you are comparing things that are only similar on a superficial level. The Mojo code is considerably lower level than the Julia code, and vectorize is not a hint to help the compiler but program that actually implements the vectorization (code taken from slides here):

fn vectorize[simd_width: Int, func: fn[width: Int](Int) -> None](size: Int):
  let simd_end = (size // simd_width) * simd_width
  for i in range(0, simd_end, simd_width):
    func[simd_width](i)
  for i in range(simd_end, size):
    func[1](1)

There is no automatic optimisation happening at all. Since you mentioned JAX, the Mojo code like the C++ implementation of MatMul in XLA, and not your high level code using jax.numpy (Julia code is on this level), or jax.lax who has for control flow primitive like fori_loop or scan on which "automatic optimisations" can happen. I agree one day we want better ergonomics for high level math code, and I absolutely love the vmap API. However, I think you are just asking it from the wrong part (low level library) of the language.

ParadaCarleton commented 9 months ago

There is no automatic optimisation happening at all.

Yep, that would be the issue I'm raising--I'd like it if there was.

Since you mentioned JAX, the Mojo code like the C++ implementation of MatMul in XLA, and not your high level code using jax.numpy (Julia code is on this level), or jax.lax who has for control flow primitive like fori_loop or scan on which "automatic optimisations" can happen.

The point I'm making is several of the optimizations here can (and should) be handled either by the compiler or by a macro (a decorator in Python, but that would be more bug-prone), because they're extremely common (and important) optimizations that I use all the time. Having them missing from Mojo's base library would be a major problem.

modularml / mojo