plasma-umass / scalene

Scalene: a high-performance, high-precision CPU, GPU, and memory profiler for Python with AI-powered optimization proposals
Apache License 2.0
12.14k stars 398 forks source link

scalene starts many sub processes which causes an out of memory issue #374

Open laplasz opened 2 years ago

laplasz commented 2 years ago

Describe the bug scalene started as an entrypoint of a container: /bin/bash -c cd /opt/lb/; scalene backendselector.py

After a few mins the container restarts with an out of memory event, even the memory resource of the container is 10Gb by checking the processes in the container it is obvious that the huge memory allocation happens because scalene starts endless sub processes..

UID        PID  PPID  C STIME TTY          TIME CMD
adc          1     0  0 16:43 ?        00:00:00 /bin/bash -c cd /opt/lb/; scalene backendselector.py
adc          7     1  0 16:43 ?        00:00:00 /usr/bin/python3.9 /usr/local/bin/scalene backendselector.py
adc          8     7  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc          9     8  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         10     9  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         11    10  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         12    11  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         13    12  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         14    13  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         15    14  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         16    15  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         17    16  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         18    17  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         19    18  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         20    19  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         21    20  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         22    21  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         23    22  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         24    23  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         25    24  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         26    25  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         27    26  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         28    27  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         29    28  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         30    29  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         31    30  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         32    31  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         33    32  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         34    33  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         35    34  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         36    35  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         37    36  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         38    37  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         39    38  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         40    39  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         41    40  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         42    41  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         43    42  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         44    43  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         45    44  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py
adc         46    45  0 16:43 ?        00:00:00 /usr/bin/python3.9 -m scalene backendselector.py

Expected behavior less processes I guess

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Additional context the python scripts using multithreads.

sternj commented 2 years ago

We've been looking for a way to stably replicate this issue for a while-- would you mind checking if you can replicate it on a clean install in a container? what Python are you using?

sternj commented 2 years ago

Referencing #353 because I suspect it is the same issue

laplasz commented 2 years ago

that is a clean install - since this is an image dedicated to this service. here is the relevant Dockerfile commands:

RUN zypper refresh \
    && zypper install \
        --auto-agree-with-licenses \
        --no-confirm \
        --no-recommends \
        iproute2 \
        iptables \
        iputils \
        less \
        python39-3.9.4-2.4 \
        python39-pip \
        python39-devel \
        conntrack-tools \
        ipset \
        libcap-progs \
        ethtool \
        ulogd \
        gcc-c++ \
        git \
        make \
        gcc 

RUN \
    http_proxy=${http_proxy} \
    https_proxy=${https_proxy} \
    no_proxy=${no_proxy} \
    pip3.9 install -U \
    --no-cache-dir \
    --index-url ${PIP_INDEX} \
    kubernetes==18.20.0 \
    Flask==2.0.1 \
    flask-restful==0.3.9 \
    pyroute2==0.6.5 \
    tabulate==0.8.9 \
    mmh3==3.0.0 \
    prometheus_client==0.11.0 \
    kafka-python==2.0.2 \
    protobuf==3.17.3 \
    grpcio==1.39.0 \
    wheel \
    pynvml \
    yappi 

RUN \
    http_proxy=${http_proxy} \
    https_proxy=${https_proxy} \
    no_proxy=${no_proxy} \
    pip3.9 install -U \
    --no-cache-dir \
    --index-url ${PIP_INDEX} \
    git+https://github.com/plasma-umass/scalene
laplasz commented 2 years ago

if you add some debug lines to the code - I can create the logs for you

laplasz commented 2 years ago

@sternj any idea what I can try to able to find the root cause?

sternj commented 2 years ago

What is the base container you are using?

sternj commented 2 years ago

Unable to replicate on a VM running OpenSUSE 20220308 (tumbleweed)

laplasz commented 2 years ago

it was SUSE Enterprise but I could easily reproduced using Ubuntu 20:04 so I guess this is not OS related - instead of python code related let me share the main method of the python code

if __name__ == "__main__":
    logger.initializeLogger()

    s = Selector()
    s.ready = True
    s.start()

Copy build content to the image

COPY --from=pre-release /unsolicited /opt/lb/unsolicited

Bird related binaries

COPY --from=pre-release /bird-2.0.8/bird /bird-2.0.8/birdc /bird-2.0.8/birdcl /usr/local/sbin/ COPY --from=pre-release /bird-2.0.8/doc/bird.conf.example /usr/local/etc/bird.conf RUN \

Bird needs this dir to be available during runtime

mkdir -p /usr/local/var/run \
# Restrict access to bird conf file
&& chmod 644 /usr/local/etc/bird.conf

ARG REVISION ARG PIP_INDEX LABEL GIT_REVISION=${REVISION} ENV DEBIAN_FRONTEND="noninteractive" TZ="Europe/London" RUN \ apt-get update && \ apt-get install -y software-properties-common

RUN \ add-apt-repository universe && \ apt-get update && \ apt-get install -y \
iproute2 \ iptables \ iputils-ping \ less \ python3.9 \ python3.9-dev \ python3-pip \ conntrack \ ipset \ net-tools \ libcap-dev \ ethtool \ ulogd2 \ git \ make \ gcc \ gawk \ arping RUN update-alternatives --log /dev/null --install /usr/bin/python3 python3 /usr/bin/python3.9 1 RUN \ http_proxy=${http_proxy} \ https_proxy=${https_proxy} \ no_proxy=${no_proxy} \ pip3 install -U \ --no-cache-dir \ --index-url $PIP_INDEX \ kubernetes==18.20.0 \ Flask==2.0.1 \ flask-restful==0.3.9 \ pyroute2==0.6.5 \ tabulate==0.8.9 \ mmh3==3.0.0 \ prometheus_client==0.11.0 \ kafka-python==2.0.2 \ protobuf==3.17.3 \ grpcio==1.39.0 \ wheel \ pynvml

RUN \ http_proxy=${http_proxy} \ https_proxy=${https_proxy} \ no_proxy=${no_proxy} \ pip3 install -U \ --no-cache-dir \ --index-url $PIP_INDEX \ git+https://github.com/plasma-umass/scalene

ARG USER="adc" ARG HOME="/home/${USER}/"

RUN \ groupadd --gid "249745" ${USER} \ && useradd --comment "Container service user" \ --create-home --home-dir "${HOME}" \ --shell /bin/false --uid 249745 --gid 249745 ${USER} \ && chown -R :root "${HOME}" && chmod -R g+s=u "${HOME}"

USER 249745:249745 WORKDIR "${HOME}" CMD ["sleep", "infinity"]

sternj commented 2 years ago

What library is Selector from?

sternj commented 2 years ago

Honestly, what would be really useful (given that this is likely proprietary code) is a minimal example of a script and a Dockerfile that exhibits this behavior-- if I can replicate it quickly then I can get to the bottom of it.

As for what information would be useful, it's... complicated. This fork bomb could be coming from several places in the stack, so unfortunately, debug lines on your end would not be helpful unless I already knew what I was looking for. A minimal example is the best possible thing, especially because this is a really severe issue.