vmware-archive / runtimes

Kubeless function runtimes: https://kubeless.io/docs/runtimes/
Apache License 2.0
81 stars 89 forks source link

Fix python runtime deadlock due to multithreading + fork #97

Closed varunbpatil closed 3 years ago

varunbpatil commented 4 years ago

Drop support for Python2.7 runtime because it has reached End Of Life and also because "spawn" and "forkserver" multiprocessing contexts are not supported in Python 2.7 which are required to fix the deadlock.

Fixes #94

varunbpatil commented 4 years ago

Hi @andresmgot , I have tested this change with 1000 requests and a concurrency of 50. I did not observe any deadlock. Without this change, the same command results in a deadlock pretty easily. The function I used to test this is stable/python/examples/preload.py which I have included in this PR.

I have also added some inline comments on the PR where I thought an explanation would be necessary.


$ hey -n 1000 -c 50 -m GET -t 0 -disable-keepalive http://localhost:8080

Summary:                                                 
  Total:        7.1151 secs
  Slowest:      2.9556 secs
  Fastest:      0.0102 secs         
  Average:      0.2326 secs                          
  Requests/sec: 140.5466                             

  Total data:   11000 bytes                          
  Size/request: 11 bytes                             

Response time histogram: 
  0.010 [1]     |     
  0.305 [939]   |________________________________________
  0.599 [4]     |   
  0.894 [0]     |   
  1.188 [21]    |_  
  1.483 [15]    |_  
  1.777 [8]     |   
  2.072 [5]     |   
  2.367 [1]     |   
  2.661 [0]     |
  2.956 [6]     |                   

Latency distribution:                                
  10% in 0.0409 secs                                 
  25% in 0.0946 secs                                 
  50% in 0.1819 secs
  75% in 0.2225 secs     
  90% in 0.2433 secs  
  95% in 1.1023 secs
  99% in 2.0323 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0582 secs, 0.0102 secs, 2.9556 secs
  DNS-lookup:   0.0007 secs, 0.0000 secs, 0.0132 secs
  req write:    0.0001 secs, 0.0000 secs, 0.0074 secs
  resp wait:    0.1724 secs, 0.0099 secs, 1.9005 secs
  resp read:    0.0019 secs, 0.0000 secs, 0.1229 secs

Status code distribution:
  [200] 1000 responses
varunbpatil commented 3 years ago

Thank you for the PR and the notes @varunbpatil! I just have a request for maintaining the req object and maintain backward compatibility.

Hi @andresmgot, sorry for the delay in fixing this. I have made changes to preserve the request object so that custom headers, query strings, forms, file uploads, etc can be accessed in the user code. The changes work, but, it is starting to look like a hack which requires understanding of the internals of the Bottle framework. I can't think of any other way to make the request object serializable.

Maybe a better design would be to run multiple single-threaded servers behind a reverse-proxy to handle multiple requests in parallel (i.e, a pre-fork model). This comes with its own problems for handling timeouts, managing multiple server processes, synchronized logging, etc.

I am thinking this PR should be sufficient to fix this somewhat critical bug while maintaining backward compatibility. Maybe we can re-evaluate the design in the near future.

I would be glad to hear your thoughts on this.

andresmgot commented 3 years ago

it seems that the test get-python-deps is failing with the new image, can you check it @varunbpatil ?

varunbpatil commented 3 years ago

it seems that the test get-python-deps is failing with the new image, can you check it @varunbpatil ?

Will check. Seems like something to do with Prometheus.

do you mind adding yourself to the OWNERS file?

Will do.

varunbpatil commented 3 years ago

it seems that the test get-python-deps is failing with the new image, can you check it @varunbpatil ?

Will check. Seems like something to do with Prometheus.

do you mind adding yourself to the OWNERS file?

Will do.

Hi @andresmgot, I have fixed the issue. Those prometheus metrics variables needed to be moved inside if __name__ == '__main__' because they can only be created once. I missed it earlier since I had commented out some of the code. I have also added myself to the OWNERS file.

I have force-pushed my changes, so your changes to the image sha256 will have to be done again since the image will be different. Sorry for the trouble.

varunbpatil commented 3 years ago

Hi @andresmgot , not sure what the current failure is. I tried with the new Docker images on a local Kubernetes cluster and it works for me.

[~/runtimes/stable/python] $ make get-python-deps
cd examples && zip hellowithdeps.zip hellowithdeps.py hellowithdepshelper.py
/home/varun.patil/runtimes/stable/python/examples
updating: hellowithdeps.py (stored 0%)
updating: hellowithdepshelper.py (deflated 29%)
kubeless function deploy get-python-deps --runtime python3.6 --handler hellowithdeps.foo --from-file examples/hellowithdeps.zip --dependencies examples/requirements.txt --image-pull-policy IfNotPresent
INFO[0000] Deploying function...
INFO[0000] Function get-python-deps submitted for deployment
INFO[0000] Check the deployment status executing 'kubeless function ls get-python-deps'

[~/runtimes/stable/python] $ make get-python-deps-verify
kubectl rollout status deployment/get-python-deps && sleep 2
deployment "get-python-deps" successfully rolled out
kubeless function call get-python-deps |egrep Google
Google

Is there a way to get the deployment status?

andresmgot commented 3 years ago

Is there a way to get the deployment status?

Let me check, I am going to run the tests with SSH and report back the issue.

andresmgot commented 3 years ago

so it's working properly, it seems that even though the pod is running, it's not able to accept requests right away. Note that there is no error in the logs so the request is not even handled:

circleci@default-dca24841-1afb-473b-90f7-86c13f586b8c:~/project/stable/python$ make get-python-deps-verify 
kubectl rollout status deployment/get-python-deps && sleep 2
deployment "get-python-deps" successfully rolled out
kubeless function call get-python-deps |egrep Google
Google

circleci@default-dca24841-1afb-473b-90f7-86c13f586b8c:~/project/stable/python$ kubectl logs get-python-deps-6b694c6f98-2hn6d
Bottle v0.12.13 server starting up (using CherryPyServer(numthreads=10))...
Listening on http://0.0.0.0:8080/
Hit Ctrl-C to quit.

172.17.0.1 - - [28/Sep/2020:09:05:26 +0000] "GET /healthz HTTP/1.1" 200 2 "" "kube-probe/1.15" 0/237
172.17.0.1 - - [28/Sep/2020:09:05:56 +0000] "GET /healthz HTTP/1.1" 200 2 "" "kube-probe/1.15" 0/139
172.17.0.1 - - [28/Sep/2020:09:06:26 +0000] "GET /healthz HTTP/1.1" 200 2 "" "kube-probe/1.15" 0/142
172.17.0.1 - - [28/Sep/2020:09:06:30 +0000] "GET / HTTP/1.1" 200 6 "https://10.142.4.173:8443/api/v1/namespaces/default/services/get-python-deps:http-function-port/proxy" "kubeless/v0.0.0 (linux/amd64) kubernetes/$Format" 0/80126

Adding a small sleep should be enough.

By the way, pull before pushing more changes so the change in the jsonnet doesn't get lost.

varunbpatil commented 3 years ago

But, there is already a 2 second sleep for every function. Do you want me to increase it for every function or just for the first function(hopefully others will be ready to accept requests by then)?

By the way, pull before pushing more changes so the change in the jsonnet doesn't get lost.

Will do.

I'll also add a dump of the K8s objects on failure similar to https://github.com/kubeless/kubeless/blob/25d6376c91730f0fecb4c6f318b368dc4ce4eb03/script/integration-tests#L86-L104

andresmgot commented 3 years ago

But, there is already a 2 second sleep for every function. Do you want me to increase it for every function or just for the first function(hopefully others will be ready to accept requests by then)?

just for that test should be fine

varunbpatil commented 3 years ago

But, there is already a 2 second sleep for every function. Do you want me to increase it for every function or just for the first function(hopefully others will be ready to accept requests by then)?

just for that test should be fine

Hi @andresmgot, increased the sleep duration to 5s for just the first test. CI is passing.

andresmgot commented 3 years ago

Great, thanks!