modularml / mojo

The Mojo Programming Language
https://docs.modular.com/mojo/manual/
Other
23.29k stars 2.59k forks source link

[BUG] parallelize can't turn hyperthreading on #2398

Open martinvuyk opened 6 months ago

martinvuyk commented 6 months ago

Bug description

I've been trying for the last week to do the 1 billion row challenge in mojo. Here is the code. I thought it was my implementation but the more I optimized the more I realized there maybe was something more.

It takes the best Go solutions about 6.5s and the best Java one 3.8s on my machine, during execution they use all threads. Mojo takes 16.3s without having every part solved (only line read and parsing), and I've used basically the same bit logic manipulations of the best Java version. I have a PC with an i5 12th gen but it has windows and I haven't gotten around to partitioning it to try the code (the docker image didn't work), and it's too much of a hassle at this point

The parallelize function uses only 4 threads of the 8 that my i7-7700HQ has, just used system-monitor and saw it. It has 4 cores, so maybe something to do with that?

I don't know if you plan on supporting old potatoes with every advanced feature of the language, but parallelize seems pretty useful to revive old server/workstation hardware

Steps to reproduce

Data about the chip from Intel's website:

System information

- What OS did you do install Mojo on ?
Ubuntu 22.04.4 LTS
- Provide version information for Mojo by pasting the output of `mojo -v`
mojo 24.2.1 (2f0dcf11)
- Provide Modular CLI version by pasting the output of `modular -v`
modular 0.7.2 (d0adc668)
martinvuyk commented 6 months ago

I also tried executing the MatMul example with 500 matrices and it only uses 4 threads.

Output of lscpu for more info:

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
    CPU family:          6
    Model:               158
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            9
    CPU max MHz:         3800,0000
    CPU min MHz:         800,0000
    BogoMIPS:            5599.85
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
                         a cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss 
                         ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art
                          arch_perfmon pebs bts rep_good nopl xtopology nonstop_
                         tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cp
                         l vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1
                          sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsav
                         e avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault
                          epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow
                          flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 
                         avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflush
                         opt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida 
                         arat pln pts hwp hwp_notify hwp_act_window hwp_epp vnmi
                          md_clear flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    1 MiB (4 instances)
  L3:                    6 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-7
martinvuyk commented 6 months ago

Used the matmul example to compile and objdump -d but I definitely can't read pthread assembly.

Here is the code and how I removed vectorization and attached is the output of objdump -d objdump_matmul.txt

result of running getconf GNU_LIBPTHREAD_VERSION: NPTL 2.35

# Copyright (c) 2023, Modular Inc. All rights reserved.
#
# Licensed under the Apache License v2.0 with LLVM Exceptions:
# https://llvm.org/LICENSE.txt
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ===----------------------------------------------------------------------=== #

# This sample demonstrates how various systems optimizations can be applied to a
# naive matmul implementation in Mojo to gain significant performance speedups

from random import rand

import benchmark
from algorithm import parallelize
from memory import memset_zero

alias M = 512  # rows of A and C
alias N = 4096  # cols of B and C
alias K = 512  # cols of A and rows of B
alias type = DType.float32

# simdwidth of = amount of `type` elements that fit into a single SIMD register
# 2x multiplier will use multiple SIMD registers in parallel where possible
alias nelts = simdwidthof[type]() * 2
alias tile_n = 64  # N must be a multiple of this
alias tile_k = 4  # K must be a multiple of this

struct Matrix[rows: Int, cols: Int]:
    var data: DTypePointer[type]

    # Initialize zeroeing all values
    fn __init__(inout self):
        self.data = DTypePointer[type].alloc(rows * cols)
        memset_zero(self.data, rows * cols)

    # Initialize taking a pointer, don't set any elements
    fn __init__(inout self, data: DTypePointer[type]):
        self.data = data

    ## Initialize with random values
    @staticmethod
    fn rand() -> Self:
        var data = DTypePointer[type].alloc(rows * cols)
        rand(data, rows * cols)
        return Self(data)

    fn __getitem__(self, y: Int, x: Int) -> Scalar[type]:
        return self.load[1](y, x)

    fn __setitem__(inout self, y: Int, x: Int, val: Scalar[type]):
        self.store[1](y, x, val)

    fn load[nelts: Int](self, y: Int, x: Int) -> SIMD[type, nelts]:
        return self.data.load[width=nelts](y * self.cols + x)

    fn store[nelts: Int](self, y: Int, x: Int, val: SIMD[type, nelts]):
        return self.data.store[width=nelts](y * self.cols + x, val)

# Parallelize the code by using the builtin parallelize function
fn matmul_parallelized(inout C: Matrix, A: Matrix, B: Matrix):
    @parameter
    fn calc_row(m: Int):
        for k in range(A.cols):
            for n in range(C.cols):
                C[m, n] += A[m, k] * B[k, n]

    parallelize[calc_row](C.rows, C.rows)

fn main() raises:
    var A = Matrix[M, K].rand()
    var B = Matrix[K, N].rand()
    var C = Matrix[M, N]()

    matmul_parallelized(C, A, B)

    A.data.free()
    B.data.free()
    C.data.free()
jackos commented 5 months ago

Hi @martinvuyk for the kernels team hyperthreading reduces performance across the majority of their kernels. There are cases where hyperthreading improves pipelining and overall performance e.g. with mandelbrot having varying execution speed. However due to API changes there isn't currently a way to turn hyperthreading on, this will be resolved soon though, and will update this ticket with progress.