[BUG] Mojo is over 10x slower than Python for reading files using ```file.read(size)``` and ```file.read_bytes(size)```

johnswede commented 1 month ago

Bug description

I was attempting to parse a large file in Mojo when I discovered it's 10x slower than using Python.

These were the speed differences when reading 4-byte chunks 1 million times.

$ python read_test.py
total runtime:  0.0748566 seconds
time per iter:  74.8566 nano secs

$ magic run mojo read_test.mojo
total runtime:  1.178206613 seconds
time per iter:  1178 nano secs

The culprit appears to be the external calls in read and read_bytes:

https://github.com/modularml/mojo/blob/b5391bae169c63d36e75f2ffa33c6dc6d8242f99/stdlib/src/builtin/file.mojo#L339-L345

Do you have any idea why the external calls are so slow?

Steps to reproduce

Profiling code

Python:

import time

def read_file(filename: str, n: int):
    with open(filename, "rb") as file:
        start_time = time.monotonic_ns()
        for _ in range(n):
            _ = file.read(4)
        end_time = time.monotonic_ns()
        runtime = end_time - start_time
        runtime_secs = runtime / (10**9)

        print("total runtime: ", runtime_secs, "seconds")
        print("time per iter: ", (runtime / n), "nano secs")

if __name__ == "__main__":
    filename = "path_to_file"
    read_file(filename, 1_000_000)

Mojo:

import time

fn read_file(filename: String, n: Int) raises:
    with open(filename, "rb") as file:
        var start_time: UInt64 = time.now()
        for _ in range(n):
            _ = file.read_bytes(4)
        var end_time: UInt64 = time.now()
        var runtime: UInt64 = end_time - start_time
        var runtime_secs: Float64 = runtime.cast[DType.float64]() / (10**9)

        print("total runtime: ", runtime_secs, "seconds")
        print("time per iter: ", (runtime / n), "nano secs")

def main():
    filename = "path_to_file"
    read_file(filename, 1_000_000)

System information

Windows 10, WSL2
CPU: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Ubuntu 22.04.2 LTS

Python 3.10.12

mojo 24.4.0 (59977802)
magic 0.2.3

owenhilyard commented 1 month ago

I'm fairly sure this is a buffered vs non-buffered IO issue. I ran both implementations and on my x86 Linux system the python version made 1654 syscalls (counted by strace piped into wc), the mojo binary (I compiled it beforehand) was at 1000549. If I switch to reading in 4096 bytes (python's default buffer size), Mojo pulls ahead by ~20%.

Does it make sense to switch the file type to buffered IO by default?

owenhilyard commented 1 month ago

Also, @johnswede, consider using mmap + madvise for parsing large files, it saves a lot of implementation work and the extra information is often easier for the kernel to work with.

modularml / mojo