Observing only marginally better speeds on smaller Linux machines with Ubuntu Jammy on multiple trials. Is there a recommended minimum for the number of cores to observe significant speedups ?
Running with TensorFlow
.......................................................................................... QPS: 3.41
Running with PyTorch
0it [00:00, ?it/s]
.......................................................................................... QPS: 3.46
Running with MAX Engine
Compiling model..
Done!
.......................................................................................... QPS: 3.80
====== Speedup Summary ======
MAX Engine vs TensorFlow: That's about 1.11x faster.
MAX Engine vs PyTorch: That's about 1.10x faster.
~/Sandbox/max/examples/performance-showcase$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7B12
CPU family: 23
Model: 49
Thread(s) per core: 2
Core(s) per socket: 1
Socket(s): 1
Stepping: 0
BogoMIPS: 4499.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht sys
call nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid tsc_k
nown_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hyper
visor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ssbd ibrs ibpb stib
p vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec
xgetbv1 clzero xsaveerptr arat npt nrip_save umip rdpid
Virtualization features:
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 32 KiB (1 instance)
L1i: 32 KiB (1 instance)
L2: 512 KiB (1 instance)
L3: 16 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0,1
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
Steps to reproduce
Include relevant code snippet or link to code that did not work as expected.
If applicable, add screenshots to help explain the problem.
Include anything else that might help us debug the issue.
System information
- What OS did you do install MAX on ?
- Ubuntu Jammy
-
- Provide version information for MAX by pasting the output of max -v`
- max 24.2.0 (c2427bc5)
- Modular version 24.2.0-c2427bc5-release
-
- Provide version information for Mojo by pasting the output of mojo -v`
- mojo 24.2.0 (c2427bc5)
-
- Provide Modular CLI version by pasting the output of `modular -v`
- modular 0.6.0 (04c05243)
Bug description
Observing only marginally better speeds on smaller Linux machines with Ubuntu Jammy on multiple trials. Is there a recommended minimum for the number of cores to observe significant speedups ?
---------------------------------------System Info---------------------------------------- CPU: AMD EPYC 7B12 Arch: X86_64 Clock speed: 2.2500 GHz Cores: 2
Running with TensorFlow .......................................................................................... QPS: 3.49
Running with PyTorch .......................................................................................... QPS: 3.45
Running with MAX Engine Compiling model.
Done! .......................................................................................... QPS: 3.89
====== Speedup Summary ======
MAX Engine vs TensorFlow: That's about 1.12x faster. MAX Engine vs PyTorch: That's about 1.13x faster.
----------------------------------------System Info---------------------------------------- CPU: AMD EPYC 7B12 Arch: X86_64 Clock speed: 2.2500 GHz Cores: 2
Running with TensorFlow .......................................................................................... QPS: 3.41
Running with PyTorch 0it [00:00, ?it/s] .......................................................................................... QPS: 3.46
Running with MAX Engine Compiling model..
Done! .......................................................................................... QPS: 3.80
====== Speedup Summary ======
MAX Engine vs TensorFlow: That's about 1.11x faster. MAX Engine vs PyTorch: That's about 1.10x faster.
~/Sandbox/max/examples/performance-showcase$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 2 On-line CPU(s) list: 0,1 Vendor ID: AuthenticAMD Model name: AMD EPYC 7B12 CPU family: 23 Model: 49 Thread(s) per core: 2 Core(s) per socket: 1 Socket(s): 1 Stepping: 0 BogoMIPS: 4499.99 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht sys call nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid tsc_k nown_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hyper visor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ssbd ibrs ibpb stib p vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save umip rdpid Virtualization features: Hypervisor vendor: KVM Virtualization type: full Caches (sum of all):
L1d: 32 KiB (1 instance) L1i: 32 KiB (1 instance) L2: 512 KiB (1 instance) L3: 16 MiB (1 instance) NUMA:
NUMA node(s): 1 NUMA node0 CPU(s): 0,1 Vulnerabilities:
Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected Srbds: Not affected Tsx async abort: Not affected
Steps to reproduce
System information