Open martinvuyk opened 6 months ago
I also tried executing the MatMul example with 500 matrices and it only uses 4 threads.
Output of lscpu
for more info:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
CPU family: 6
Model: 158
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 9
CPU max MHz: 3800,0000
CPU min MHz: 800,0000
BogoMIPS: 5599.85
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
a cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss
ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art
arch_perfmon pebs bts rep_good nopl xtopology nonstop_
tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cp
l vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1
sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsav
e avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault
epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow
flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1
avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflush
opt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida
arat pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi
md_clear flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 1 MiB (4 instances)
L3: 6 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-7
Used the matmul example to compile and objdump -d
but I definitely can't read pthread assembly.
Here is the code and how I removed vectorization and attached is the output of objdump -d
objdump_matmul.txt
result of running getconf GNU_LIBPTHREAD_VERSION
: NPTL 2.35
# Copyright (c) 2023, Modular Inc. All rights reserved.
#
# Licensed under the Apache License v2.0 with LLVM Exceptions:
# https://llvm.org/LICENSE.txt
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ===----------------------------------------------------------------------=== #
# This sample demonstrates how various systems optimizations can be applied to a
# naive matmul implementation in Mojo to gain significant performance speedups
from random import rand
import benchmark
from algorithm import parallelize
from memory import memset_zero
alias M = 512 # rows of A and C
alias N = 4096 # cols of B and C
alias K = 512 # cols of A and rows of B
alias type = DType.float32
# simdwidth of = amount of `type` elements that fit into a single SIMD register
# 2x multiplier will use multiple SIMD registers in parallel where possible
alias nelts = simdwidthof[type]() * 2
alias tile_n = 64 # N must be a multiple of this
alias tile_k = 4 # K must be a multiple of this
struct Matrix[rows: Int, cols: Int]:
var data: DTypePointer[type]
# Initialize zeroeing all values
fn __init__(inout self):
self.data = DTypePointer[type].alloc(rows * cols)
memset_zero(self.data, rows * cols)
# Initialize taking a pointer, don't set any elements
fn __init__(inout self, data: DTypePointer[type]):
self.data = data
## Initialize with random values
@staticmethod
fn rand() -> Self:
var data = DTypePointer[type].alloc(rows * cols)
rand(data, rows * cols)
return Self(data)
fn __getitem__(self, y: Int, x: Int) -> Scalar[type]:
return self.load[1](y, x)
fn __setitem__(inout self, y: Int, x: Int, val: Scalar[type]):
self.store[1](y, x, val)
fn load[nelts: Int](self, y: Int, x: Int) -> SIMD[type, nelts]:
return self.data.load[width=nelts](y * self.cols + x)
fn store[nelts: Int](self, y: Int, x: Int, val: SIMD[type, nelts]):
return self.data.store[width=nelts](y * self.cols + x, val)
# Parallelize the code by using the builtin parallelize function
fn matmul_parallelized(inout C: Matrix, A: Matrix, B: Matrix):
@parameter
fn calc_row(m: Int):
for k in range(A.cols):
for n in range(C.cols):
C[m, n] += A[m, k] * B[k, n]
parallelize[calc_row](C.rows, C.rows)
fn main() raises:
var A = Matrix[M, K].rand()
var B = Matrix[K, N].rand()
var C = Matrix[M, N]()
matmul_parallelized(C, A, B)
A.data.free()
B.data.free()
C.data.free()
Hi @martinvuyk for the kernels team hyperthreading reduces performance across the majority of their kernels. There are cases where hyperthreading improves pipelining and overall performance e.g. with mandelbrot having varying execution speed. However due to API changes there isn't currently a way to turn hyperthreading on, this will be resolved soon though, and will update this ticket with progress.
Bug description
I've been trying for the last week to do the 1 billion row challenge in mojo. Here is the code. I thought it was my implementation but the more I optimized the more I realized there maybe was something more.
It takes the best Go solutions about 6.5s and the best Java one 3.8s on my machine, during execution they use all threads. Mojo takes 16.3s without having every part solved (only line read and parsing), and I've used basically the same bit logic manipulations of the best Java version. I have a PC with an i5 12th gen but it has windows and I haven't gotten around to partitioning it to try the code (the docker image didn't work), and it's too much of a hassle at this point
The parallelize function uses only 4 threads of the 8 that my i7-7700HQ has, just used system-monitor and saw it. It has 4 cores, so maybe something to do with that?
I don't know if you plan on supporting old potatoes with every advanced feature of the language, but parallelize seems pretty useful to revive old server/workstation hardware
Steps to reproduce
Data about the chip from Intel's website:
System information