vhive-serverless / vHive

vHive: Open-source framework for serverless experimentation
MIT License
265 stars 84 forks source link

Improve function cold-start latency #960

Closed huasiy closed 2 months ago

huasiy commented 4 months ago

Describe the enhancement We are using vHive as the cloud function platform in our work. This great project provides a white-box and tunable serverless computing infrastructure for our research. However, in the experiments, we met problems with the cold-start latency of cloud functions. We define cold-start latency as the time elapsed from when the user sends a function invocation request to when a new function instance in k8s pods is created and starts processing the request. When we invoke many cloud function instances to process a query in parallel, the query may suffer from a high cold-start latency of more than 10 seconds. It would be much appreciated if you could give us some suggestions on how to reduce the cold-start latency. We notice the cold-start latency in the ASPLOS'21 paper of vHive is much lower (< 1 second if REAP is used, as shown in Figure 9 of the paper) and is what we expect for query processing. However, in our experiment, the cold-start latency is much higher, especially when creating more than four function instances (k8s pods) on each node (k8s worker). The cumulative distribution of cold-start latency is as follows. image We have tried different function images (including 'helloworld' in vHive), different memory sizes and CPUs for each function instance, different virtualization solutions (docker, firecracker, firecracker+snapshot, firecracker+REAP), and different physical environments (CloudLab, AWS, and on-perm servers, all with SSDs and >=10G network). The cold-start latency is not significantly reduced. Is there anything that we should take care of to get the low cold-start latency (~ one second) under high invocation concurrency, like Figure 9 in the paper?

lrq619 commented 2 months ago

Hello huasiy, sorry for the late reply. The cold start latency in vHive paper doesn't count into the system overhead of serverless backend like knative, only local worker delay is measured in this figure. According to your description, what you measured is "end-to-end" delay of the first invocation of serverless function, which would include overhead on the control plane side. If you want to reproduce the results, please refer to our artifact on zenado: https://zenodo.org/records/4545584 Thanks for your attention! Feel free to reach out to us if you have further questions