openzim / python-libzim

Libzim binding for Python: read/write ZIM files in Python
https://pypi.org/project/libzim/
GNU General Public License v3.0
62 stars 22 forks source link

Memory overhead? #111

Closed rgaudin closed 3 years ago

rgaudin commented 3 years ago

Just realized the Creator's starting step has a surprisingly large overhead and I'm wondering if this is to be expected or if something's wrong with the wrapper.

Here's the simplest use case:

import pathlib

from libzim.writer import Creator

with Creator(filename=pathlib.Path("test_ram_plz.zim")) as creator:
    pass
PYTHONPATH=src time -l python test_ram_plz.py
Resolve redirect
set index
        0.37 real         0.18 user         0.18 sys
           690446336  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              170546  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   1  messages sent
                   1  messages received
                   0  signals received
                  11  voluntary context switches
                 619  involuntary context switches
           958867008  instructions retired
          1364227186  cycles elapsed
           686837760  peak memory footprint

This seems to have used 658.46 MiB of RAM…

When just instantiating the Creator, I get a mere 8MiB (python itselfs consuming about 6.5MiB) so it seems that startZimCreation() is responsible for this.

Tested on macOS.

rgaudin commented 3 years ago

Re-running this on latest libzim gives more realistic measures: 98.46 MiB

/usr/bin/time -l python test_ram_plz.py
Resolve redirect
set index
        0.10 real         0.06 user         0.03 sys
           103239680  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               27173  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   1  messages sent
                   1  messages received
                   0  signals received
                  11  voluntary context switches
                 258  involuntary context switches
           364553572  instructions retired
           355994042  cycles elapsed
            99545088  peak memory footprint

I'd leave that as an in-dev libzim bug that got fixed along the way and close this.

FYI, tweaking the indexing, the nb. of workers or the cluster size has no noticeable impact.