modularml / max

A collection of sample programs, notebooks, and tools which highlight the power of the MAX Platform
https://www.modular.com
Other
202 stars 31 forks source link

[BUG]: Max Performance Showcase Comparison Results #122

Closed igoforth closed 3 months ago

igoforth commented 3 months ago

Bug description

Performance isn't as expected.

➜  performance-showcase git:(main) ✗ modular host-info
  Host Information
  ================

  Target Triple: x86_64-unknown-linux
  CPU: tigerlake
  CPU Features: adx, aes, avx, avx2, avx512bitalg, avx512bw, avx512cd, avx512dq, avx512f, avx512ifma, avx512vbmi, avx512vbmi2, avx512vl, avx512vnni, avx512vp2intersect, avx512vpopcntdq, bmi, bmi2, clflushopt, clwb, cmov, crc32, cx16, cx8, evex512, f16c, fma, fsgsbase, fxsr, gfni, invpcid, kl, lzcnt, mmx, movbe, movdir64b, movdiri, pclmul, pku, popcnt, prfchw, rdpid, rdrnd, rdseed, sahf, sgx, sha, shstk, sse, sse2, sse3, sse4.1, sse4.2, ssse3, vaes, vpclmulqdq, widekl, x87, xsave, xsavec, xsaveopt, xsaves
➜  performance-showcase git:(main) ✗ python3 run.py -m roberta
Doing some one time setup. This takes 5 minutes or so, depending on the model.
Get a cup of coffee and we'll see you in a minute!

Done! [100%]

Starting inference throughput comparison

----------------------------------------System Info----------------------------------------
CPU: 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz
Arch: X86_64
Clock speed: 2.5000 GHz
Cores: 4

Running with TensorFlow
.......................................................................................... QPS: 9.31

Running with PyTorch
.......................................................................................... QPS: 6.32

Running with MAX Engine
Compiling model..
Done!
.......................................................................................... QPS: 7.68

====== Speedup Summary ======

MAX Engine vs TensorFlow: Oh, darn that's only 0.82x stock performance.
MAX Engine vs PyTorch: Oh, darn that's only 0.82x stock performance.

Hold on a tick... We normally see speedups of roughly 2.50x on TensorFlow for roberta on X86_64. Honestly, we would love to hear from you to learn more about the system you're running on! (https://github.com/modularml/max/issues/new/choose)

Steps to reproduce

Ubuntu clang version 19.0.0 (++20240318042139+208a9850e6a4-1~exp1~20240318042301.1564) Target: x86_64-pc-linux-gnu Thread model: posix InstalledDir: /usr/lib/llvm-19/bin

cc (Debian 13.2.0-13) 13.2.0 Copyright (C) 2023 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Python 3.11.8 I use pdm and pyenv to manage my python environment PYTHONPATH=/home/user/.local/share/pdm/venv/lib/python3.11/site-packages/pdm/pep582 PWD=/home/user/.local/src/max/examples/performance-showcase

System information

- What OS did you do install MAX on ?
Linux kali 6.6.9-amd64 #1 SMP PREEMPT_DYNAMIC Kali 6.6.9-1kali1 (2024-01-08) x86_64 GNU/Linux
- Provide version information for MAX by pasting the output of max -v`
max 24.1.1 (0ab415f7)
Modular version 24.1.1-0ab415f7-release
- Provide version information for Mojo by pasting the output of mojo -v`
mojo 24.1.1 (0ab415f7)
- Provide Modular CLI version by pasting the output of `modular -v`
modular 0.5.2 (6b3a04fd)
ehsanmok commented 3 months ago

Thanks for reporting this issue! We're working on it. Please also take a look at this explainer.

ehsanmok commented 3 months ago

Also, please see our FAQ for this.