Slower Matrix multiplication than numpy

taalhaataahir0102 commented 4 months ago

Bug description

I've tried running the Mojo matmul file available in the repository inside examples directory (https://github.com/modularml/mojo/blob/main/examples/matmul.mojo) The output of the file shows that the most optimized matrix multiplication in Mojo is still 3 to 4 times slower than that in Numpy. Following are the results:

CPU Results

Python:         0.003 GFLOPS
Numpy:        363.124 GFLOPS
Naive:          6.225 GFLOPS   2138.28x Python
Vectorized:    22.350 GFLOPS   7677.58x Python
Parallelized: 102.933 GFLOPS  35358.39x Python
Tiled:        104.982 GFLOPS  36062.43x Python
Unrolled:     107.915 GFLOPS  37069.74x Python

Could someone please explain the performance difference I'm seeing? The most common operation in machine learning is matrix multiplication, and I've noticed that it's slower in Mojo compared to NumPy. NumPy, which is a Python library wrapping optimized C code, is commonly used for AI tasks. Considering that most people use NumPy for these purposes (rather than using matmul written purely in python for ML tasks), what's the motivation behind using Mojo if it's not performing as well as conventional Python code using NumPy?

Steps to reproduce

git clone https://github.com/modularml/mojo.git
cd mojo-main/examples
mojo build matmul.mojo
./matmul

System information

OS: Ubuntu 22.04.3 LTS
Mojo version: mojo 24.3.0 (9882e19d)
Modular version: modular 0.7.4 (df7a9e8b)

jiex-liu commented 4 months ago

Mojo is not faster than numpy for now, but it will be. The point of mojo is to make it a system level language which provides the same if not more performance as C/C++ without having to write convoluted C code. It is still a very young language with much to improve.

MoSafi2 commented 4 months ago

@taalhaataahir0102 Numpy (in most cases) delegates the linear algebra operations to optimized libraries like BLAS and LAPACK, the same approch which is also taken by other langages like Julia which also delegates the linalg operations. so the matumul in numpy it is not "just" numpy. Optimized libraries also have different implementation based on factors like shape, size, and data type of the matrix .. etc. https://numpy.org/doc/stable/reference/routines.linalg.html The provided exmaples in the mojo repo is a toy example with what kinds of optimization possible in the language itself with zero-dependancies on 3rd party libraries. the performance attained by BLAS is possible to achieve and surpass in mojo (currently implemented in the closed-source max engine) and maybe it can be implemented in open source with more people intrested in the language.

jackos commented 3 months ago

Thanks for clarifying, also note that the MatMul used for the MAX engine is much faster than Numpy, as it's optimized for different CPU's: https://www.modular.com/blog/the-worlds-fastest-unified-matrix-multiplication. We should check if MAX is installed and run the fully optimized version to compare performance in this example.

taalhaataahir0102 commented 3 months ago

Thanks @jackos for the reply. I was trying MAX graph API, created a simple graph which takes in 2 tensors of shape 100,100 as input and multiply them 1000 times. Similarly I created 2 random numpy arrays and multiplied them 1000 time. Still numpy is performing better (around 20-25x faster than MAX graph API). Here's my code:

from max.graph import Graph, TensorType, Type
from max import engine
from tensor import Tensor, TensorShape, randn
from time import now
from python import Python

def main():
    graph = Graph(in_types=List[Type](TensorType(DType.float32, 100,100), TensorType(DType.float32, 100,100)))
    out = graph[0] @ graph[1]
    graph.output(out)
    graph.verify()

    session = engine.InferenceSession()
    model = session.load(graph)

    var input0 = randn[DType.float32]((100, 100))
    var input1 = randn[DType.float32]((100, 100))
    var start = now()
    for i in range(1000):
        var ret = model.execute("input0", input0, "input1", input1)
    var end = now()

    var execution_time_seconds :  Float32 = (end-start) / 1000000000
    print("MAX GRAPH API:",execution_time_seconds)

    var np = Python.import_module("numpy")
    array1 = np.random.rand(100, 100)
    array2 = np.random.rand(100, 100)
    var start1 = now()
    for i in range(1000):
        var result = np.dot(array1, array2)
    var end1 = now()

    var execution_time_seconds1 :  Float32 = (end1-start1) / 1000000000
    print("NUMPY:",execution_time_seconds1)

And these are the results (in seconds):

Can you please confirm if there's something I'm doing wrong? Here are my system details: MAX version: max 24.3.0 (9882e19d) Mojo version: mojo 24.3.0 (9882e19d) OS: Ubuntu 22.04.3 LTS

martinvuyk commented 3 months ago

2 tensors of shape 100,100 as input and multiply them 1000 times

try bigger arrays, the numpy (AOCL-BLAS / AOCL-LAPACK for your AMD CPU) impl has probably some optimizations for small arrays that the Modular team might've yet to add. Or it's cheating and using the rocBLAS AMD GPU implementation XD.

Here are my system details

In my case Mojo doesn't use every thread available for my CPU, even though it has the ISA extensions Mojo expects. Raised an issue #2398 but never got an answer. There are some issues for consumer grade hardware that simply are not a priority for them as far as I see it. If you have some kind of workstation you should try it on there.

After a quick google search it seems like your Ryzen 9 5900HX also has the ISA extensions required. btw don't post your linux username on a public forum

jackos commented 3 months ago

Thanks @martinvuyk sorry for not responding to that, updated the ticket.

jon-chuang commented 2 months ago

The reason for this is that they unmerged my PR which made it 4x faster than numpy https://github.com/modularml/mojo/pull/1280

jon-chuang commented 2 months ago

Created a PR to add the loop reordering / split-K sub-tiling with 2-3x better perf.

@taalhaataahir0102 could you please try this?

jackos commented 2 months ago

Hi @jon-chuang fantastic your implementation is passing tests now and 2.5x faster than numpy on my m2! Left a couple of comments in your PR. You may also be interested in the #mojo-marathons channel on Discord, there is a competition running for the fastest matmul implementation with swag for the fastest implementation at the end of the month.

jackos commented 2 months ago

Resolved by https://github.com/modularml/mojo/pull/3250 thanks heaps @jon-chuang!

YichengDWu commented 2 months ago

This was closed by mistake. Still slower than numpy.

jon-chuang commented 2 months ago

If you can solve reliably benchmarking numpy within mojo you get to create and close another issue.

Before that happens, its unclear if this issue should be open or closed.

Btw, if you really want proper perf, go use MAX. They have properly optimized kernels there.

What I contributed is just an example; it's basically child's play.

If you want, you can try to recreate BLAS performance. Just implement at least 3 levels of cache tiling, and you might be pretty close.

jackos commented 2 months ago

This was closed by mistake. Still slower than numpy.

The change hasn't come across to the nightly branch yet, it'll be in the next nightly release.

jon-chuang commented 2 months ago

Anw, this issue should be closed. The purpose of the example is to show plausible ways you can go about optimizing your CPU code using Mojo's syntatic sugar / composable combinators.

My contribution just shows a nice "next step" that closes the gap with vendored perf a bit an exposes concepts like loop reordering and stack-based accumulation (similar to smem/register accumulation in GPU programming model).

YichengDWu commented 2 months ago

I didn't want the future readers to have a wrong impression mojo can just be 2.5x faster than numpy on personal computers. Let me be conservative and say numpy(a.k.a OpenBLAS) achieves 90% of the peak gflops, 2.5x of that just exceeds the peak gflops. I didn't need to look into what was going wrong the numpy benchmarking nor look into the mojo implementation to know it's wrong, just common sense here.

jon-chuang commented 2 months ago

None of what you say is proven. So if you can fix the numpy benchmarking issue, please make a contribution. Common sense or not, I'll believe it when I see reliable numbers.

jon-chuang commented 2 months ago

Created an issue: https://github.com/modularml/mojo/issues/3253

YichengDWu commented 2 months ago

Here is a proof if you want one

$2.5*0.9 =2.25>1$

You should run a pure python script for benchmarking numpy if you want to see reliable numbers to begin with.

jon-chuang commented 2 months ago

created a simple graph which takes in 2 tensors of shape 100,100 as input and multiply them 1000 times.

These are extremely non standard sizes and too small to really care about. Typical matrix size we care about have at least 512 in one or more dimensions. I agree in principle that these would ideally be optimized in MAX but I doubt it is a priority unless you are running quite tiny models. If you have a specific use case on sizes which show suboptimal perf I suggest you file a report in the max repo: https://github.com/modularml/max/tree/main/examples

@YichengDWu please read the issue discussions carefully. The fact that examples/matmul.mojo is slower than numpy is a non-issue to begin with. This is like saying andrej karpathy's llama.c should beat gpt4. They serve a different role, education and production.

YichengDWu commented 2 months ago

I was simply saying the 2.5x thing is false.

I don't have an issue with your implementation slower than numpy. I am simply making an observation it is not faster.

It's more like "of course llama.c can't beat gpt4, so it's ok to say llama.c can generate imgages", two negatives do not make a positive.

modularml / mojo