iddohau / rust-strings

Extracting strings from binary data
MIT License
6 stars 1 forks source link

Huge memory usage #72

Open Krakoer opened 3 weeks ago

Krakoer commented 3 weeks ago

Hi,

While using the lib, I witnessed a huge memory usage (peak of ~ 230Mo to extract strings from a 22Mo sample) from the python lib but not from the binary. I suspect there is a lot of overhead while allocating strings, but the memory usage drops when the strings are returned from the lib.

To monitor the memory usage, I used memory-profiler and a python script that loads the data in memory, waits for a second, extracts the strings using rust-strings, waits for a second and exits.

Do you have an idea of what can cause such a memory usage? I'll continue to investigate on my side.

iddohau commented 3 weeks ago

Hi,

Thanks for reporting this issue; this is indeed a little problem. I suspect that the conversion from Rust to Python has an overhead, but not that much of an overhead. I'll try to take a look at this next week.

iddohau commented 2 weeks ago

I've created a large file using this script:

with open("large_file.bin", "wb") as f:
    for _ in range(1024 * 1024):
        f.write(b"X" * 20)
        f.write(b"\xff\xff\xff\xff")

This will create a file with size around 20MB which contains a lot of strings. I've reproduced the problem using this script:

import time

import rust_strings
from memory_profiler import profile

@profile()
def main():
    time.sleep(1)
    x = rust_strings.strings("large_file.bin")
    time.sleep(1)

if __name__ == "__main__":
    main()

The memory huge consumption reproduce:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     7     22.1 MiB     22.1 MiB           1   @profile()
     8                                         def main():
     9     22.1 MiB      0.0 MiB           1       time.sleep(1)
    10    206.8 MiB    184.7 MiB           1       x = rust_strings.strings("large_file.bin")
    11    206.8 MiB      0.0 MiB           1       time.sleep(1)

I've tried to debug it but I don't think there is a problem. The list contains million of items, which consume more memory than one big string.

Krakoer commented 2 weeks ago

Indeed, the issue doesn't show up when providing a file path to strings, but it does when using the bytes input option:

import time

import rust_strings
from memory_profiler import profile

@profile()
def main():
    with open("large_file.bin", 'rb') as f:
        data = f.read()
    time.sleep(1)
    x = rust_strings.strings(bytes=data)
    time.sleep(1)

if __name__ == "__main__":
    main()

Gives this profile: prof (Black is my modified code)