materialscloud-org / voila-materialscloud-template

The Voila template for Materials Cloud.
Other
2 stars 3 forks source link

optimadeclient stuck in loading #76

Closed unkcpz closed 1 year ago

unkcpz commented 2 years ago

Hi @dou-du

I got following error message when I found optimadeclient is stuck in loading. There is template related information in tracklog.

2022-06-29T18:27:09.904617195Z app[web.1]: Traceback (most recent call last):                                                                                                                                                              
2022-06-29T18:27:09.904621217Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/tornado/web.py", line 1704, in _execute                                                                                                 
2022-06-29T18:27:09.904624340Z app[web.1]:     result = await result                                                                                                                                                                       
2022-06-29T18:27:09.904627185Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/voila/handler.py", line 225, in get
2022-06-29T18:27:09.904630068Z app[web.1]:     async for html in gen:                                                                                                                                                                      
2022-06-29T18:27:09.904632749Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/voila/handler.py", line 142, in get_generator
2022-06-29T18:27:09.904635723Z app[web.1]:     rendered, rendering = await render_task                                                                                                                                                     
2022-06-29T18:27:09.904638655Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/asyncio/futures.py", line 263, in __await__
2022-06-29T18:27:09.904641468Z app[web.1]:     return self.result()  # May raise too.                                                                                                                                                      
2022-06-29T18:27:09.904644247Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/asyncio/futures.py", line 178, in result
2022-06-29T18:27:09.904647112Z app[web.1]:     raise self._exception                                                                                                                                                                       
2022-06-29T18:27:09.904649728Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/asyncio/tasks.py", line 280, in __step
2022-06-29T18:27:09.904652492Z app[web.1]:     result = coro.send(None)                                                                                                                                                                    
2022-06-29T18:27:09.904655121Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/voila/notebook_renderer.py", line 185, in generate_content_hybrid                                                                       
2022-06-29T18:27:09.904658065Z app[web.1]:     async for html_snippet, _ in generator:                                                                                                                                                     
2022-06-29T18:27:09.904660706Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/voila/exporter.py", line 101, in generate_from_notebook_node
2022-06-29T18:27:09.904663547Z app[web.1]:     async for output in self.template.generate_async(nb=nb_copy, resources=resources, **extra_context, static_url=self.static_url):
2022-06-29T18:27:09.904666434Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/jinja2/environment.py", line 1373, in generate_async
2022-06-29T18:27:09.904669299Z app[web.1]:     yield self.environment.handle_exception()                                                                                                                                                   
2022-06-29T18:27:09.904672262Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/jinja2/environment.py", line 936, in handle_exception                                                                                   
2022-06-29T18:27:09.904675209Z app[web.1]:     raise rewrite_traceback_stack(source=source)                                                                                                                                                
2022-06-29T18:27:09.904677865Z app[web.1]:   File "/app/.heroku/python/share/jupyter/voila/templates/materialscloud-iframe/index.html.j2", line 1, in top-level template code                                                              
2022-06-29T18:27:09.904680708Z app[web.1]:     {%- extends 'nbconvert/templates/materialscloud-iframe/index.html.j2' -%}                                        
2022-06-29T18:27:09.904684511Z app[web.1]:   File "/app/.heroku/python/share/jupyter/nbconvert/templates/materialscloud-iframe/index.html.j2", line 7, in top-level template code
2022-06-29T18:27:09.904687556Z app[web.1]:     {% set nb_title = nb.metadata.get('title', '') or resources['metadata']['name'] %}                                 
2022-06-29T18:27:09.904690436Z app[web.1]:   File "/app/.heroku/python/share/jupyter/voila/templates/lab/index.html.j2", line 4, in top-level template code                             
2022-06-29T18:27:09.904693266Z app[web.1]:     {% from 'voila_setup.macro.html.j2' import voila_setup_helper_functions, voila_setup_nbextensions with context %}
2022-06-29T18:27:09.904696072Z app[web.1]:   File "/app/.heroku/python/share/jupyter/nbconvert/templates/lab/index.html.j2", line 3, in top-level template code                                                                            
2022-06-29T18:27:09.904703048Z app[web.1]:     {% from 'jupyter_widgets.html.j2' import jupyter_widgets %}                                                                                                                                 
2022-06-29T18:27:09.904706085Z app[web.1]:   File "/app/.heroku/python/share/jupyter/nbconvert/templates/lab/base.html.j2", line 260, in top-level template code
2022-06-29T18:27:09.904708955Z app[web.1]:     {% set div_id = uuid4() %}                                         
2022-06-29T18:27:09.904711640Z app[web.1]:   File "/app/.heroku/python/share/jupyter/nbconvert/templates/base/display_priority.j2", line 1, in top-level template code                                                                     
2022-06-29T18:27:09.904714536Z app[web.1]:     {%- extends 'base/null.j2' -%}                                                                                                                                                              
2022-06-29T18:27:09.904717273Z app[web.1]:   File "/app/.heroku/python/share/jupyter/nbconvert/templates/base/null.j2", line 26, in top-level template code
2022-06-29T18:27:09.904720094Z app[web.1]:     {%- block body -%}                                                 
2022-06-29T18:27:09.904722741Z app[web.1]:   File "/app/.heroku/python/share/jupyter/nbconvert/templates/base/null.j2", line 29, in block 'body'
2022-06-29T18:27:09.904725536Z app[web.1]:     {%- block body_loop -%}
2022-06-29T18:27:09.904728207Z app[web.1]:   File "/app/.heroku/python/share/jupyter/nbconvert/templates/materialscloud-iframe/index.html.j2", line 395, in block 'body_loop'
2022-06-29T18:27:09.904731351Z app[web.1]:     {{ super() }}
2022-06-29T18:27:09.904733956Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/jinja2/async_utils.py", line 65, in auto_await
2022-06-29T18:27:09.904736768Z app[web.1]:     return await t.cast("t.Awaitable[V]", value)
2022-06-29T18:27:09.904739617Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/jinja2/runtime.py", line 366, in <listcomp>
2022-06-29T18:27:09.904743322Z app[web.1]:     [x async for x in self._stack[self._depth](self._context)]  # type: ignore
2022-06-29T18:27:09.904746087Z app[web.1]:   File "/app/.heroku/python/share/jupyter/voila/templates/lab/index.html.j2", line 47, in block 'body_loop'
2022-06-29T18:27:09.904748924Z app[web.1]:     {%- block body_loop -%}
2022-06-29T18:27:09.904751556Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/jinja2/async_utils.py", line 65, in auto_await
2022-06-29T18:27:09.904754659Z app[web.1]:     return await t.cast("t.Awaitable[V]", value)
2022-06-29T18:27:09.904757422Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/voila/notebook_renderer.py", line 152, in inner_kernel_start
2022-06-29T18:27:09.904760286Z app[web.1]:     return await self._jinja_kernel_start(nb, kernel_id, kernel_future)
2022-06-29T18:27:09.904763547Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/voila/notebook_renderer.py", line 213, in _jinja_kernel_start
2022-06-29T18:27:09.904766514Z app[web.1]:     self.executor.kc = await self.executor.async_start_new_kernel_client() 
2022-06-29T18:27:09.904769232Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/nbclient/client.py", line 532, in async_start_new_kernel_client
2022-06-29T18:27:09.904772073Z app[web.1]:     await ensure_async(self.kc.wait_for_ready(timeout=self.startup_timeout))
2022-06-29T18:27:09.904774803Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/nbclient/util.py", line 96, in ensure_async
2022-06-29T18:27:09.904784041Z app[web.1]:   File "/app/.heroku/python/lib/python3.8/site-packages/jupyter_client/client.py", line 205, in _async_wait_for_ready
2022-06-29T18:27:09.904786972Z app[web.1]:     raise RuntimeError("Kernel didn't respond in %d seconds" % timeout)
2022-06-29T18:27:09.904789892Z app[web.1]: RuntimeError: Kernel didn't respond in 60 seconds
2022-06-29T18:27:09.904792584Z app[web.1]: ERROR:tornado.general:Cannot send error response after headers written
giovannipizzi commented 2 years ago

Is this reproducible? If so, to understand if the issue is with the template or not, I suggest to deploy another app (e.g. on the dev machine) with a standard template. If it's the template, check what changed (maybe newer versions of voila require changes to the template?). Or if it's not the template, change the code simplifying it to see when it stops timing out?

unkcpz commented 2 years ago

@giovannipizzi thanks a lot for the suggestions.

Is this reproducible?

I won't say it is fully reproducible. It appears more frequently when I use less pre-heat kernel. On the dev server, since the load is not big, it rarely happened. If there is a tool to intentionally try to request the service and force it to open more kernels, we can give it a try. But I still do not fully understand how voila allocate kernels. It is based on IP or every time a user opens a browser session it will open a kernel? @dou-du Do you know how it works under the hood?

I think it will be also useful if we have some tool that can conduct a stress test for any dokku app. @eimrek Is that possible to have this as what the monitor server is doing now?

eimrek commented 2 years ago

Hi, could this be related to https://github.com/gliderlabs/herokuish/issues/659?

And also this fix might be related:

https://github.com/materialscloud-org/openstack-ansible/blob/master/roles/mc_dokku/tasks/fix-dockerfile.yml

Edit: regarding the stress test, yes, in principle we could use something similar as we use in the monitor server to access the app many times: https://github.com/materialscloud-org/openstack-ansible/blob/master/roles/mc_monitor_server/files/check_httpjs.py

unkcpz commented 2 years ago

@eimrek thanks a lot! I think they are related. The change in fix-dockerfile.yml made by @ltalirz was to change the permission of /dev/random to 640 to make it inaccessible. However, neither the /dev/random on host or on the optimadeclient container have 444. I am not pretty sure what is this gliderlabs/herokuish image used for but there is a new one with the tag latest which I go into image and the /dev/random is 444.

Maybe @dou-du ever build the image by hand? I remember you mentioned you have updated the version? Can you confirm with it?

I think to fix this, I need first figure out what is this gliderlabs/herokuish used for and check if the /dev/random is modified to workaround the issue. If that is the case, we can tear down all the app and delete the gliderlabs/herokuish images and redeploy the server by running the whole ansible role. Any idea and comment on this? @giovannipizzi @eimrek

unkcpz commented 2 years ago

We can now use python 3.8 for optimade-client which means dokku or herokuish (I need to spend some time to understand how these things work under the hood, sorry for the vagueness of the terminologies here.) is updated as mentioned at https://github.com/materialscloud-org/tools-optimade-client/pull/42

dou-du commented 2 years ago

@eimrek thanks a lot! I think they are related. The change in fix-dockerfile.yml made by @ltalirz was to change the permission of /dev/random to 640 to make it inaccessible. However, neither the /dev/random on host or on the optimadeclient container have 444. I am not pretty sure what is this gliderlabs/herokuish image used for but there is a new one with the tag latest which I go into image and the /dev/random is 444.

Maybe @dou-du ever build the image by hand? I remember you mentioned you have updated the version? Can you confirm with it?

I think to fix this, I need first figure out what is this gliderlabs/herokuish used for and check if the /dev/random is modified to workaround the issue. If that is the case, we can tear down all the app and delete the gliderlabs/herokuish images and redeploy the server by running the whole ansible role. Any idea and comment on this? @giovannipizzi @eimrek

I updated the version number from ansible.

ltalirz commented 2 years ago

Indeed, just start from the latest version of the image and apply the fix to prevent the "stuck due to lack of entropy" issue.

unkcpz commented 2 years ago

Thanks @ltalirz. Could you elaborate a bit on how the image gliderlabs/herokuish:latest is used in the dokku server? I using docker run -it gliderlabs/herokuish:latest to go into the container and find the /start is the modified one with the change you applied. But the /dev/random permission is not changed. Is that because /start script as entrypoint is not executed or is that the /dev/random is not able to apply changes for permission?

ltalirz commented 2 years ago

If you just run the image with the entry point bash, then the permission won't be changed.

herokuish is built on top of the heroku docker image and basically does what they do (detect what kind of app the user is trying to run, install dependencies, etc.)

unkcpz commented 2 years ago

If you just run the image with the entry point bash, then the permission won't be changed.

Then how do I test if this change is taking effect? I just can not find where the image built by ansible task https://github.com/materialscloud-org/openstack-ansible/blob/master/roles/mc_dokku/tasks/fix-dockerfile.yml is used in the server.

ltalirz commented 2 years ago

Then how do I test if this change is taking effect?

Well, if you exec into a container started by dokku, you will see.

I just can not find where the image built by ansible task https://github.com/materialscloud-org/openstack-ansible/blob/master/roles/mc_dokku/tasks/fix-dockerfile.yml is used in the server.

Well, dokku should be using this image (you may need to look at the dokku source code).

In any case, you can simply look at the containers running in one of the dokku instances and see whether they are using the image

unkcpz commented 1 year ago

@ltalirz thanks and sorry for the late reply, I was on vacation last week.

Well, if you exec into a container started by dokku, you will see.

I did this and it is for sure not using have the /dev/random set to correct permission.

In any case, you can simply look at the containers running in one of the dokku instances and see whether they are using the image

I use docker inspect to trace the image of the optimadeclient and then inspect the image. What I found are 1) the sha256 id of Parent image can not be found in the list of docker ps and the herokuish's dockerhub https://hub.docker.com/r/gliderlabs/herokuish/tags?page=1. 2) There are maybe related useful label information to debug, I attached below.

            "Env": [                                                                                                                                                                                                                       
                "CACHE_PATH=/cache",                                                                                                                                                                                                       
                "USER=herokuishuser",                                                                                                                                                                                                      
                "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",                                                                                                                                                       
                "STACK=heroku-20",                                                                                                                                                                                                         
                "DEBIAN_FRONTEND=noninteractive"                                                                                                                                                                                           
            ],                                                                                                                                                                                                                             
            "Cmd": [                                                                                                                                                                                                                       
                "/bin/sh",                                                                                                                                                                                                                 
                "-c",                                                                                                                                                                                                                      
                "#(nop) ",                                                                                                                                                                                                                 
                "LABEL org.label-schema.vendor=dokku"                                                                                                                                                                                      
            ],                                                                                                                                                                                                                             
            "Image": "sha256:832c50abf474930c7cee20461729e54fc30280bf55f9b0f3a40136637a2bbfe1",                                                                                                                                            
            "Volumes": null,                                                                                                                                                                                                               
            "WorkingDir": "",                                                                                                                                                                                                              
            "Entrypoint": null,                                                                                                                                                                                                            
            "OnBuild": null,                                                                                                                                                                                                               
            "Labels": {                                                                                                                                                                                                                    
                "com.dokku.app-name": "optimadeclient",                                                                                                                                                                                    
                "com.dokku.image-stage": "release",                                                                                                                                                                                        
                "com.gliderlabs.herokuish/stack": "heroku-20",                                                                                                                                                                             
                "dokku": "",                                                                                                                                                                                                               
                "org.label-schema.schema-version": "1.0",                                                                                                                                                                                  
                "org.label-schema.vendor": "dokku"                                                                                                                                                                                         
            }     

I suspect the new version of dokku use the new tag point to the image, I am curious what this heroku-20 means. Do you have any idea what to look further?