Open laplasz opened 2 years ago
We've been looking for a way to stably replicate this issue for a while-- would you mind checking if you can replicate it on a clean install in a container? what Python are you using?
Referencing #353 because I suspect it is the same issue
that is a clean install - since this is an image dedicated to this service. here is the relevant Dockerfile commands:
RUN zypper refresh \
&& zypper install \
--auto-agree-with-licenses \
--no-confirm \
--no-recommends \
iproute2 \
iptables \
iputils \
less \
python39-3.9.4-2.4 \
python39-pip \
python39-devel \
conntrack-tools \
ipset \
libcap-progs \
ethtool \
ulogd \
gcc-c++ \
git \
make \
gcc
RUN \
http_proxy=${http_proxy} \
https_proxy=${https_proxy} \
no_proxy=${no_proxy} \
pip3.9 install -U \
--no-cache-dir \
--index-url ${PIP_INDEX} \
kubernetes==18.20.0 \
Flask==2.0.1 \
flask-restful==0.3.9 \
pyroute2==0.6.5 \
tabulate==0.8.9 \
mmh3==3.0.0 \
prometheus_client==0.11.0 \
kafka-python==2.0.2 \
protobuf==3.17.3 \
grpcio==1.39.0 \
wheel \
pynvml \
yappi
RUN \
http_proxy=${http_proxy} \
https_proxy=${https_proxy} \
no_proxy=${no_proxy} \
pip3.9 install -U \
--no-cache-dir \
--index-url ${PIP_INDEX} \
git+https://github.com/plasma-umass/scalene
if you add some debug lines to the code - I can create the logs for you
@sternj any idea what I can try to able to find the root cause?
What is the base container you are using?
Unable to replicate on a VM running OpenSUSE 20220308
(tumbleweed)
it was SUSE Enterprise but I could easily reproduced using Ubuntu 20:04 so I guess this is not OS related - instead of python code related let me share the main method of the python code
if __name__ == "__main__":
logger.initializeLogger()
s = Selector()
s.ready = True
s.start()
FROM ubuntu:20.04 as image
COPY --from=pre-release /unsolicited /opt/lb/unsolicited
COPY --from=pre-release /bird-2.0.8/bird /bird-2.0.8/birdc /bird-2.0.8/birdcl /usr/local/sbin/ COPY --from=pre-release /bird-2.0.8/doc/bird.conf.example /usr/local/etc/bird.conf RUN \
mkdir -p /usr/local/var/run \
# Restrict access to bird conf file
&& chmod 644 /usr/local/etc/bird.conf
ARG REVISION ARG PIP_INDEX LABEL GIT_REVISION=${REVISION} ENV DEBIAN_FRONTEND="noninteractive" TZ="Europe/London" RUN \ apt-get update && \ apt-get install -y software-properties-common
RUN \
add-apt-repository universe && \
apt-get update && \
apt-get install -y \
iproute2 \
iptables \
iputils-ping \
less \
python3.9 \
python3.9-dev \
python3-pip \
conntrack \
ipset \
net-tools \
libcap-dev \
ethtool \
ulogd2 \
git \
make \
gcc \
gawk \
arping
RUN update-alternatives --log /dev/null --install /usr/bin/python3 python3 /usr/bin/python3.9 1
RUN \
http_proxy=${http_proxy} \
https_proxy=${https_proxy} \
no_proxy=${no_proxy} \
pip3 install -U \
--no-cache-dir \
--index-url $PIP_INDEX \
kubernetes==18.20.0 \
Flask==2.0.1 \
flask-restful==0.3.9 \
pyroute2==0.6.5 \
tabulate==0.8.9 \
mmh3==3.0.0 \
prometheus_client==0.11.0 \
kafka-python==2.0.2 \
protobuf==3.17.3 \
grpcio==1.39.0 \
wheel \
pynvml
RUN \ http_proxy=${http_proxy} \ https_proxy=${https_proxy} \ no_proxy=${no_proxy} \ pip3 install -U \ --no-cache-dir \ --index-url $PIP_INDEX \ git+https://github.com/plasma-umass/scalene
ARG USER="adc" ARG HOME="/home/${USER}/"
RUN \ groupadd --gid "249745" ${USER} \ && useradd --comment "Container service user" \ --create-home --home-dir "${HOME}" \ --shell /bin/false --uid 249745 --gid 249745 ${USER} \ && chown -R :root "${HOME}" && chmod -R g+s=u "${HOME}"
USER 249745:249745 WORKDIR "${HOME}" CMD ["sleep", "infinity"]
What library is Selector from?
Honestly, what would be really useful (given that this is likely proprietary code) is a minimal example of a script and a Dockerfile that exhibits this behavior-- if I can replicate it quickly then I can get to the bottom of it.
As for what information would be useful, it's... complicated. This fork bomb could be coming from several places in the stack, so unfortunately, debug lines on your end would not be helpful unless I already knew what I was looking for. A minimal example is the best possible thing, especially because this is a really severe issue.
Describe the bug scalene started as an entrypoint of a container:
/bin/bash -c cd /opt/lb/; scalene backendselector.py
After a few mins the container restarts with an out of memory event, even the memory resource of the container is 10Gb by checking the processes in the container it is obvious that the huge memory allocation happens because scalene starts endless sub processes..
Expected behavior less processes I guess
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):
Additional context the python scripts using multithreads.