Kaleido Zombie Processes and standard error handling

rgrzeszi commented 4 years ago

Hi guys,

we are running plotly and kaleido and we are generating a large number of plots (usually rendered as svg or png) on potentially large images. I observed quite a large number of processes which are not being stopped properly (see below) up until the point that no more processes can be forked and the whole program crashes.

rgrzeszi 12083  0.0  0.0  12888  3168 pts/0    S+   11:47   0:00 /bin/bash /home/rgrzeszi/venv-py3.7/lib/python3.7/site-packages/kaleido/executable/kaleido plotly --disable-gpu
rgrzeszi 12088  0.9  0.0 340052 58716 pts/0    Sl+  11:47   0:00 /home/rgrzeszi/venv-py3.7/lib/python3.7/site-packages/kaleido/executable/bin/kaleido --no-sandbox --allow-file-access-from-files --disable-breakpad --disable-gpu plotly
rgrzeszi 12090  0.0  0.0 167868 26992 pts/0    S+   11:47   0:00 /home/rgrzeszi/venv-py3.7/lib/python3.7/site-packages/kaleido/executable/bin/kaleido --type=zygote --no-zygote-sandbox --no-sandbox --headless --headless
rgrzeszi 12091  0.0  0.0 167868 26848 pts/0    S+   11:47   0:00 /home/rgrzeszi/venv-py3.7/lib/python3.7/site-packages/kaleido/executable/bin/kaleido --type=zygote --no-sandbox --headless --headless
rgrzeszi 12104  0.0  0.0 216124 37884 pts/0    Sl+  11:47   0:00 /home/rgrzeszi/venv-py3.7/lib/python3.7/site-packages/kaleido/executable/bin/kaleido --type=gpu-process --field-trial-handle=11814550139442659559,15247937476695373268,131072 --no-sandbox --disable-breakpad --headless --ozone-platform=headless --headless --gpu-preferences=OAAAAAAAAAAgAAAgAAAAAAAAAAAAAAAAAABgAAAAAAAYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAAA== --use-gl=swiftshader-webgl --override-use-software-gl-for-tests --shared-files
rgrzeszi 12105  0.2  0.0 257008 42640 pts/0    Sl+  11:47   0:00 /home/rgrzeszi/venv-py3.7/lib/python3.7/site-packages/kaleido/executable/bin/kaleido --type=utility --field-trial-handle=11814550139442659559,15247937476695373268,131072 --lang=en-US --service-sandbox-type=network --no-sandbox --use-gl=swiftshader-webgl --headless --shared-files
rgrzeszi 12106  7.0  0.0 4665756 97004 pts/0   Sl+  11:47   0:02 /home/rgrzeszi/venv-py3.7/lib/python3.7/site-packages/kaleido/executable/bin/kaleido --type=renderer --no-sandbox --allow-pre-commit-input --disable-breakpad --ozone-platform=headless --field-trial-handle=11814550139442659559,15247937476695373268,131072 --disable-databases --disable-gpu-compositing --lang=en-US --headless --num-raster-threads=4 --enable-main-frame-before-activation --renderer-client-id=3 --shared-files
rgrzeszi 12266  0.0  0.0  12888  3044 pts/0    S+   11:48   0:00 /bin/bash /home/rgrzeszi/venv-py3.7/lib/python3.7/site-packages/kaleido/executable/kaleido plotly --disable-gpu
rgrzeszi 12271  1.3  0.0 330820 57508 pts/0    Sl+  11:48   0:00 /home/rgrzeszi/venv-py3.7/lib/python3.7/site-packages/kaleido/executable/bin/kaleido --no-sandbox --allow-file-access-from-files --disable-breakpad --disable-gpu plotly
rgrzeszi 12273  0.0  0.0 167868 27084 pts/0    S+   11:48   0:00 /home/rgrzeszi/venv-py3.7/lib/python3.7/site-packages/kaleido/executable/bin/kaleido --type=zygote --no-zygote-sandbox --no-sandbox --headless --headless
rgrzeszi 12274  0.0  0.0 167868 27332 pts/0    S+   11:48   0:00 /home/rgrzeszi/venv-py3.7/lib/python3.7/site-packages/kaleido/executable/bin/kaleido --type=zygote --no-sandbox --headless --headless
rgrzeszi 12286  0.1  0.0 216124 36500 pts/0    Sl+  11:48   0:00 /home/rgrzeszi/venv-py3.7/lib/python3.7/site-packages/kaleido/executable/bin/kaleido --type=gpu-process --field-trial-handle=7050013125018487994,7150407316231483164,131072 --no-sandbox --disable-breakpad --headless --ozone-platform=headless --headless --gpu-preferences=OAAAAAAAAAAgAAAgAAAAAAAAAAAAAAAAAABgAAAAAAAYAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIAAAAAAAAAA== --use-gl=swiftshader-webgl --override-use-software-gl-for-tests --shared-files
rgrzeszi 12287  0.3  0.0 257032 41764 pts/0    Sl+  11:48   0:00 /home/rgrzeszi/venv-py3.7/lib/python3.7/site-packages/kaleido/executable/bin/kaleido --type=utility --field-trial-handle=7050013125018487994,7150407316231483164,131072 --lang=en-US --service-sandbox-type=network --no-sandbox --use-gl=swiftshader-webgl --headless --shared-files
rgrzeszi 12288  9.6  0.0 4661808 96432 pts/0   Sl+  11:48   0:02 /home/rgrzeszi/venv-py3.7/lib/python3.7/site-packages/kaleido/executable/bin/kaleido --type=renderer --no-sandbox --allow-pre-commit-input --disable-breakpad --ozone-platform=headless --field-trial-handle=7050013125018487994,7150407316231483164,131072 --disable-databases --disable-gpu-compositing --lang=en-US --headless --num-raster-threads=4 --enable-main-frame-before-activation --renderer-client-id=4 --shared-files
rgrzeszi 12486  0.0  0.0  12888  3144 pts/0    S+   11:48   0:00 /bin/bash /home/rgrzeszi/venv-py3.7/lib/python3.7/site-packages/kaleido/executable/kaleido plotly --disable-gpu
rgrzeszi 12491  2.4  0.0 331856 57220 pts/0    Sl+  11:48   0:00 /home/rgrzeszi/venv-py3.7/lib/python3.7/site-packages/kaleido/executable/bin/kaleido --no-sandbox --allow-file-access-from-files --disable-breakpad --disable-gpu plotly

Following the workaround here: https://github.com/plotly/Kaleido/issues/42

I implemented a call which forcefully shuts down kaleido:

scope = PlotlyScope()
with open(path, 'wb') as f:
     f.write(scope.transform(fig, format=export_format))
# Shutdown kaleido subprocess to free memory, it will
# be started again on next image export request
# https://github.com/plotly/Kaleido/issues/42
scope._shutdown_kaleido()

This partially solved the issue at hand. However I can now observe the following behavior. Depending on the time when I shutdown kaleido I run into a deadlock situation with the collection of the standard error sooner or later:

def _collect_standard_error(self):
"""
Write standard-error of subprocess to the _std_error StringIO buffer.
Intended to be called once in a background thread
"""
while True:
    if self._proc is not None:
        val = self._proc.stderr.readline()
        self._std_error.write(val)

My current workaround is to break the condition if the process is None.

while True:
    if self._proc is not None:
        val = self._proc.stderr.readline()
        self._std_error.write(val) 
    else:
        break

Any help / feedback would be appreciated.

jonmmease commented 4 years ago

Thanks for the report and the deadlock PR fix in #50.

Regarding the process build up, can you tell whether:

Duplicate processes are showing up during the execution of a single Python/kaleido instance.
Process are not being cleaned up when the Python process exits and them more are created when a new Python/kaleido instance is launched.

The memory leak fix in https://github.com/plotly/Kaleido/pull/43 involves periodically reloading the headless Chromium tab that kaleido uses, and if you're seeing (1) above, it would be helpful to know if this makes any difference for you.

You can install the alpha build of kaleido that has this fix with:

https://github.com/plotly/Kaleido/releases/download/v0.1.0a2/kaleido-0.1.0a2-py2.py3-none-manylinux1_x86_64.whl

If (2), do you know if the Python process that's driving kaleido is always exiting cleanly (without crashing)? The chromium process should be shut down when Python exits and calls the __del__ method on the base scope, but something might be going on that's causing this to not get called.

Thanks!

rgrzeszi commented 4 years ago

Hello Jon,

it's (1) a single python instance which runs a data analysis and generates quite a bunch of plots.

A method in a plotting class is called multiple times, in which the Kaleido Scope is created as shown above. With every write a new instance is spawned but it seems at least some of them do not terminate correctly. In my understanding the scope should be created within the method and when leaving the method del would implicitly be called which should then call the _kaleido_shutdown and would avoid the deadlock issue. However, it seemed that this is note the case. I would have to run more experiments on this.

I cannot pinpoint it to a single call, but it seems that simpler plots may not cause this issue (i.e. a simple pie plot). I do visualize more complex things like heatmaps on larger background images (3-4 Megapixel). I assume that the process does not terminate correctly in these cases.

jonmmease commented 4 years ago

A method in a plotting class is called multiple times, in which the Kaleido Scope is created as shown above. With every write a new instance is spawned but it seems at least some of them do not terminate correctly. In my understanding the scope should be created within the method and when leaving the method del would implicitly be called which should then call the _kaleido_shutdown and would avoid the deadlock issue. However, it seemed that this is note the case. I would have to run more experiments on this.

Ok, this does actually make sense. The __del__ method isn't guaranteed to be called when the method exits (https://docs.python.org/3/reference/datamodel.html?highlight=__del__#object.__del__). So it's not too surprising that the chromium subprocesses build up with this workflow. It's possible that that thread watching standard error is preventing the reference count of the scope from dropping to zero, but that would just be a guess.

The workflow that the Kaleido scope is designed for, to this point, is to reuse a single scope repeatedly so that the chromium startup time is only required the first time. Is this architecture possible for you?

The alternative is to make sure that the chromium subprocess shuts down when you are finished exporting images with the scope. We should probably create a public shutdown method and document that this should be called to guarantee that the chromium subprocess is shut down, and we could also make the kaleido scope closable so that you could use it in a context manager like this:

with PlotlyScope(...) as scope:
    # Chromium subprocess launched
    scope.transform()

# Chromium subprocess shut down

rgrzeszi commented 4 years ago

I believe I tried creating a single scope and it had the same issue, but I will confirm this.

rgrzeszi commented 4 years ago

You were absolutely right, the del has not been called and the subprocesses did build up due to this fact. Strangely enough this does not happen on all machines. I had to do some rewriting to really create a single scope but that seems to solve the issue and thus also avoids the infinite loop in the error handler (as I no longer call _shutdown_kaleido manually) - thanks!.

jonmmease commented 4 years ago

Thanks for reporting back @rgrzeszi. Glad it's working for you now! I'll still get your PR in, and consider where to document this potential pitfall.

jonmmease commented 3 years ago

Alright to close this @rgrzeszi?

plotly / Kaleido

Kaleido Zombie Processes and standard error handling #49