vhive-serverless / vHive

vHive: Open-source framework for serverless experimentation
MIT License
265 stars 84 forks source link

Error in function deployment #945

Closed ymc101 closed 2 months ago

ymc101 commented 4 months ago

Hi, I am trying out the whole vHive setup with function deployment and invocation using 1 master node and 1 worker node running on 2 separate VMs, and I encountered some errors when running the deployer client source /etc/profile && pushd ./tools/deployer && go build && popd && ./tools/deployer/deployer -funcPath ~/vhive/configs/knative_workloads This was the output:

WARN[0602] Failed to deploy function helloworld-0, /home/vboxuser/vhive/configs/knative_workloads/helloworld.yaml: exit status 1
Warning: Kubernetes default value is insecure, Knative may default this to secure in a future release: spec.template.spec.containers[0].securityContext.allowPrivilegeEscalation, spec.template.spec.containers[0].securityContext.capabilities, spec.template.spec.containers[0].securityContext.runAsNonRoot, spec.template.spec.containers[0].securityContext.seccompProfile
Creating service 'helloworld-0' in namespace 'default':

  2.963s The Route is still working to reflect the latest desired specification.
  5.347s Configuration "helloworld-0" is waiting for a Revision to become ready.
Error: timeout: service 'helloworld-0' not ready after 600 seconds
Run 'kn --help' for usage

INFO[0602] Deployed function helloworld-0               
WARN[0602] Failed to deploy function pyaes-1, /home/vboxuser/vhive/configs/knative_workloads/pyaes.yaml: exit status 1
Warning: Kubernetes default value is insecure, Knative may default this to secure in a future release: spec.template.spec.containers[0].securityContext.allowPrivilegeEscalation, spec.template.spec.containers[0].securityContext.capabilities, spec.template.spec.containers[0].securityContext.runAsNonRoot, spec.template.spec.containers[0].securityContext.seccompProfile
Creating service 'pyaes-1' in namespace 'default':

  1.442s The Route is still working to reflect the latest desired specification.
  4.234s Configuration "pyaes-1" is waiting for a Revision to become ready.
Error: timeout: service 'pyaes-1' not ready after 600 seconds
Run 'kn --help' for usage

INFO[0602] Deployed function pyaes-1                    
WARN[0603] Failed to deploy function pyaes-0, /home/vboxuser/vhive/configs/knative_workloads/pyaes.yaml: exit status 1
Warning: Kubernetes default value is insecure, Knative may default this to secure in a future release: spec.template.spec.containers[0].securityContext.allowPrivilegeEscalation, spec.template.spec.containers[0].securityContext.capabilities, spec.template.spec.containers[0].securityContext.runAsNonRoot, spec.template.spec.containers[0].securityContext.seccompProfile
Creating service 'pyaes-0' in namespace 'default':

  4.206s The Route is still working to reflect the latest desired specification.
  5.117s ...
  5.621s Configuration "pyaes-0" is waiting for a Revision to become ready.
Error: timeout: service 'pyaes-0' not ready after 600 seconds
Run 'kn --help' for usage

INFO[0603] Deployed function pyaes-0                    
WARN[0603] Failed to deploy function rnn-serving-1, /home/vboxuser/vhive/configs/knative_workloads/rnn_serving.yaml: exit status 1
Warning: Kubernetes default value is insecure, Knative may default this to secure in a future release: spec.template.spec.containers[0].securityContext.allowPrivilegeEscalation, spec.template.spec.containers[0].securityContext.capabilities, spec.template.spec.containers[0].securityContext.runAsNonRoot, spec.template.spec.containers[0].securityContext.seccompProfile
Creating service 'rnn-serving-1' in namespace 'default':

  0.751s The Route is still working to reflect the latest desired specification.
  3.778s Configuration "rnn-serving-1" is waiting for a Revision to become ready.
Error: timeout: service 'rnn-serving-1' not ready after 600 seconds
Run 'kn --help' for usage

INFO[0603] Deployed function rnn-serving-1              
WARN[0603] Failed to deploy function rnn-serving-0, /home/vboxuser/vhive/configs/knative_workloads/rnn_serving.yaml: exit status 1
Warning: Kubernetes default value is insecure, Knative may default this to secure in a future release: spec.template.spec.containers[0].securityContext.allowPrivilegeEscalation, spec.template.spec.containers[0].securityContext.capabilities, spec.template.spec.containers[0].securityContext.runAsNonRoot, spec.template.spec.containers[0].securityContext.seccompProfile
Creating service 'rnn-serving-0' in namespace 'default':

  2.567s The Route is still working to reflect the latest desired specification.
  4.244s Configuration "rnn-serving-0" is waiting for a Revision to become ready.
Error: timeout: service 'rnn-serving-0' not ready after 600 seconds
Run 'kn --help' for usage

INFO[0603] Deployed function rnn-serving-0              
WARN[1207] Failed to deploy function rnn-serving-2, /home/vboxuser/vhive/configs/knative_workloads/rnn_serving.yaml: exit status 1
Warning: Kubernetes default value is insecure, Knative may default this to secure in a future release: spec.template.spec.containers[0].securityContext.allowPrivilegeEscalation, spec.template.spec.containers[0].securityContext.capabilities, spec.template.spec.containers[0].securityContext.runAsNonRoot, spec.template.spec.containers[0].securityContext.seccompProfile
Creating service 'rnn-serving-2' in namespace 'default':

  2.081s The Route is still working to reflect the latest desired specification.
  3.126s ...
  5.313s Configuration "rnn-serving-2" is waiting for a Revision to become ready.
Error: timeout: service 'rnn-serving-2' not ready after 600 seconds
Run 'kn --help' for usage

Additionally, below are logs from some commands I tried:

kubectl describe deployment :

Name:                   helloworld-0-00001-deployment
Namespace:              default
CreationTimestamp:      Tue, 27 Feb 2024 15:14:38 +0800
Labels:                 app=helloworld-0-00001
                        service.istio.io/canonical-name=helloworld-0
                        service.istio.io/canonical-revision=helloworld-0-00001
                        serving.knative.dev/configuration=helloworld-0
                        serving.knative.dev/configurationGeneration=1
                        serving.knative.dev/configurationUID=36b65317-e523-4ec3-8ea6-8734ebdf4d7b
                        serving.knative.dev/revision=helloworld-0-00001
                        serving.knative.dev/revisionUID=933839c6-a4fd-4bcf-907b-725a455a2503
                        serving.knative.dev/service=helloworld-0
                        serving.knative.dev/serviceUID=c8e131fc-8a06-46a1-8895-b7fd8d9ada06
Annotations:            autoscaling.knative.dev/target: 1
                        deployment.kubernetes.io/revision: 1
                        serving.knative.dev/creator: kubernetes-admin
Selector:               serving.knative.dev/revisionUID=933839c6-a4fd-4bcf-907b-725a455a2503
Replicas:               0 desired | 0 updated | 0 total | 0 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  0 max unavailable, 25% max surge
Pod Template:
  Labels:       app=helloworld-0-00001
                service.istio.io/canonical-name=helloworld-0
                service.istio.io/canonical-revision=helloworld-0-00001
                serving.knative.dev/configuration=helloworld-0
                serving.knative.dev/configurationGeneration=1
                serving.knative.dev/configurationUID=36b65317-e523-4ec3-8ea6-8734ebdf4d7b
                serving.knative.dev/revision=helloworld-0-00001
                serving.knative.dev/revisionUID=933839c6-a4fd-4bcf-907b-725a455a2503
                serving.knative.dev/service=helloworld-0
                serving.knative.dev/serviceUID=c8e131fc-8a06-46a1-8895-b7fd8d9ada06
  Annotations:  autoscaling.knative.dev/target: 1
                serving.knative.dev/creator: kubernetes-admin
  Containers:
   user-container:
    Image:      index.docker.io/crccheck/hello-world@sha256:0404ca69b522f8629d7d4e9034a7afe0300b713354e8bf12ec9657581cf59400
    Port:       50051/TCP
    Host Port:  0/TCP
    Environment:
      GUEST_PORT:       50051
      GUEST_IMAGE:      ghcr.io/ease-lab/helloworld:var_workload
      PORT:             50051
      K_REVISION:       helloworld-0-00001
      K_CONFIGURATION:  helloworld-0
      K_SERVICE:        helloworld-0
    Mounts:             <none>
   queue-proxy:
    Image:       ghcr.io/vhive-serverless/queue-39be6f1d08a095bd076a71d288d295b6@sha256:41259c52c99af616fae4e7a44e40c0e90eb8f5593378a4f3de5dbf35ab1df49c
    Ports:       8022/TCP, 9090/TCP, 9091/TCP, 8013/TCP, 8112/TCP
    Host Ports:  0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    Requests:
      cpu:      25m
    Readiness:  http-get http://:8013/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      SERVING_NAMESPACE:                        default
      SERVING_SERVICE:                          helloworld-0
      SERVING_CONFIGURATION:                    helloworld-0
      SERVING_REVISION:                         helloworld-0-00001
      QUEUE_SERVING_PORT:                       8013
      QUEUE_SERVING_TLS_PORT:                   8112
      CONTAINER_CONCURRENCY:                    0
      REVISION_TIMEOUT_SECONDS:                 300
      REVISION_RESPONSE_START_TIMEOUT_SECONDS:  0
      REVISION_IDLE_TIMEOUT_SECONDS:            0
      SERVING_POD:                               (v1:metadata.name)
      SERVING_POD_IP:                            (v1:status.podIP)
      SERVING_LOGGING_CONFIG:                   
      SERVING_LOGGING_LEVEL:                    
      SERVING_REQUEST_LOG_TEMPLATE:             {"httpRequest": {"requestMethod": "{{.Request.Method}}", "requestUrl": "{{js .Request.RequestURI}}", "requestSize": "{{.Request.ContentLength}}", "status": {{.Response.Code}}, "responseSize": "{{.Response.Size}}", "userAgent": "{{js .Request.UserAgent}}", "remoteIp": "{{js .Request.RemoteAddr}}", "serverIp": "{{.Revision.PodIP}}", "referer": "{{js .Request.Referer}}", "latency": "{{.Response.Latency}}s", "protocol": "{{.Request.Proto}}"}, "traceId": "{{index .Request.Header "X-B3-Traceid"}}"}
      SERVING_ENABLE_REQUEST_LOG:               false
      SERVING_REQUEST_METRICS_BACKEND:          prometheus
      TRACING_CONFIG_BACKEND:                   none
      TRACING_CONFIG_ZIPKIN_ENDPOINT:           
      TRACING_CONFIG_DEBUG:                     false
      TRACING_CONFIG_SAMPLE_RATE:               0.1
      USER_PORT:                                50051
      SYSTEM_NAMESPACE:                         knative-serving
      METRICS_DOMAIN:                           knative.dev/internal/serving
      SERVING_READINESS_PROBE:                  {"tcpSocket":{"port":50051,"host":"127.0.0.1"},"successThreshold":1}
      ENABLE_PROFILING:                         false
      SERVING_ENABLE_PROBE_REQUEST_LOG:         false
      METRICS_COLLECTOR_ADDRESS:                
      CONCURRENCY_STATE_ENDPOINT:               
      CONCURRENCY_STATE_TOKEN_PATH:             /var/run/secrets/tokens/state-token
      HOST_IP:                                   (v1:status.hostIP)
      ENABLE_HTTP2_AUTO_DETECTION:              false
      ROOT_CA:                                  
    Mounts:                                     <none>
  Volumes:                                      <none>
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   helloworld-0-00001-deployment-85b6cd4698 (0/0 replicas created)
Events:
  Type    Reason             Age   From                   Message
  ----    ------             ----  ----                   -------
  Normal  ScalingReplicaSet  55m   deployment-controller  Scaled up replica set helloworld-0-00001-deployment-85b6cd4698 to 1
  Normal  ScalingReplicaSet  45m   deployment-controller  Scaled down replica set helloworld-0-00001-deployment-85b6cd4698 to 0 from 1

kubectl get revisions and kubectl describe revision <name> :

Name:         helloworld-0-00001
Namespace:    default
Labels:       serving.knative.dev/configuration=helloworld-0
              serving.knative.dev/configurationGeneration=1
              serving.knative.dev/configurationUID=36b65317-e523-4ec3-8ea6-8734ebdf4d7b
              serving.knative.dev/routingState=active
              serving.knative.dev/service=helloworld-0
              serving.knative.dev/serviceUID=c8e131fc-8a06-46a1-8895-b7fd8d9ada06
Annotations:  autoscaling.knative.dev/target: 1
              serving.knative.dev/creator: kubernetes-admin
              serving.knative.dev/routes: helloworld-0
              serving.knative.dev/routingStateModified: 2024-02-27T07:14:33Z
API Version:  serving.knative.dev/v1
Kind:         Revision
Metadata:
  Creation Timestamp:  2024-02-27T07:14:33Z
  Generation:          1
  Managed Fields:
    API Version:  serving.knative.dev/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:autoscaling.knative.dev/target:
          f:serving.knative.dev/creator:
          f:serving.knative.dev/routes:
          f:serving.knative.dev/routingStateModified:
        f:labels:
          .:
          f:serving.knative.dev/configuration:
          f:serving.knative.dev/configurationGeneration:
          f:serving.knative.dev/configurationUID:
          f:serving.knative.dev/routingState:
          f:serving.knative.dev/service:
          f:serving.knative.dev/serviceUID:
        f:ownerReferences:
          .:
          k:{"uid":"36b65317-e523-4ec3-8ea6-8734ebdf4d7b"}:
      f:spec:
        .:
        f:containerConcurrency:
        f:containers:
        f:enableServiceLinks:
        f:timeoutSeconds:
    Manager:      controller
    Operation:    Update
    Time:         2024-02-27T07:14:33Z
    API Version:  serving.knative.dev/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:actualReplicas:
        f:conditions:
        f:containerStatuses:
        f:observedGeneration:
    Manager:      controller
    Operation:    Update
    Subresource:  status
    Time:         2024-02-27T07:25:29Z
  Owner References:
    API Version:           serving.knative.dev/v1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Configuration
    Name:                  helloworld-0
    UID:                   36b65317-e523-4ec3-8ea6-8734ebdf4d7b
  Resource Version:        24730
  UID:                     933839c6-a4fd-4bcf-907b-725a455a2503
Spec:
  Container Concurrency:  0
  Containers:
    Env:
      Name:   GUEST_PORT
      Value:  50051
      Name:   GUEST_IMAGE
      Value:  ghcr.io/ease-lab/helloworld:var_workload
    Image:    crccheck/hello-world:latest
    Name:     user-container
    Ports:
      Container Port:  50051
      Name:            h2c
      Protocol:        TCP
    Readiness Probe:
      Success Threshold:  1
      Tcp Socket:
        Port:  0
    Resources:
  Enable Service Links:  false
  Timeout Seconds:       300
Status:
  Actual Replicas:  0
  Conditions:
    Last Transition Time:  2024-02-27T07:25:29Z
    Message:               The target is not receiving traffic.
    Reason:                NoTraffic
    Severity:              Info
    Status:                False
    Type:                  Active
    Last Transition Time:  2024-02-27T07:14:40Z
    Reason:                Deploying
    Status:                Unknown
    Type:                  ContainerHealthy
    Last Transition Time:  2024-02-27T07:24:50Z
    Message:               Failed to get/pull image: failed to prepare extraction snapshot "extract-305755493-hrFD sha256:5216338b40a7b96416b8b9858974bbe4acc3096ee60acbc4dfb1ee02aecceb10": context deadline exceeded
    Reason:                CreateContainerError
    Status:                False
    Type:                  Ready
    Last Transition Time:  2024-02-27T07:24:50Z
    Message:               Failed to get/pull image: failed to prepare extraction snapshot "extract-305755493-hrFD sha256:5216338b40a7b96416b8b9858974bbe4acc3096ee60acbc4dfb1ee02aecceb10": context deadline exceeded
    Reason:                CreateContainerError
    Status:                False
    Type:                  ResourcesAvailable
  Container Statuses:
    Image Digest:       index.docker.io/crccheck/hello-world@sha256:0404ca69b522f8629d7d4e9034a7afe0300b713354e8bf12ec9657581cf59400
    Name:               user-container
  Observed Generation:  1
Events:                 <none>
leokondrashov commented 4 months ago

Please attach Firecracker logs from /tmp/vhive-logs/

ymc101 commented 4 months ago

from worker node: firecracker.stderr.zip firecracker.stdout.zip

ymc101 commented 4 months ago

Hi @leokondrashov I just tried running the setup, deployment, and invocation steps on CloudLab, and it proceeded smoothly without incident, however the output file rps0.00_lat.csv is empty. Is this expected behaviour? Am I missing certain steps or configurations in order to get the latencies of the function runs?

For reference below is the endpoints.json file from the master node:

[
    {
        "hostname": "helloworld-0.default.192.168.1.240.sslip.io",
        "eventing": false,
        "matchers": null
    },
    {
        "hostname": "pyaes-0.default.192.168.1.240.sslip.io",
        "eventing": false,
        "matchers": null
    },
    {
        "hostname": "pyaes-1.default.192.168.1.240.sslip.io",
        "eventing": false,
        "matchers": null
    },
    {
        "hostname": "rnn-serving-0.default.192.168.1.240.sslip.io",
        "eventing": false,
        "matchers": null
    },
    {
        "hostname": "rnn-serving-1.default.192.168.1.240.sslip.io",
        "eventing": false,
        "matchers": null
    },
    {
        "hostname": "rnn-serving-2.default.192.168.1.240.sslip.io",
        "eventing": false,
        "matchers": null
    }
]
leokondrashov commented 4 months ago

Sorry for the late follow-up. The 0.00 in the name of the file means that all invocations failed. In the output of the invoker run, how many requests succeeded should be said. Although, that might be caused by cold starts taking some time. You can try to rerun the invoker several times so that it can warm the instances up. If the requests still fail, please provide output of kubectl get pods, there should be pods in running states, if they are not, please run kubectl describe pod <pod_name> on them.

ymc101 commented 4 months ago

I ran it 2 more times after getting 0.00, and this was the contents of the csv:

8651
8922
17281
64952
41603

Is this expected behaviour? And if yes can I clarify what does each number correspond to, and what is the units of the numbers in milliseconds?

leokondrashov commented 4 months ago

They are the end-to-end delay measurements for each of the requests in microseconds.

ymc101 commented 4 months ago

Does each number correspond to one function in endpoints.json? Or does it represent the time it takes for all functions to finish executing (concurrently or sequentially?)

Incidentally, I tried running the invoker once on the worker node and i got 1 value in the rps_0.20.csv file, so im a bit confused on the representation of the output

leokondrashov commented 4 months ago

It reports end-to-end latencies for requests to functions from endpoints.json in round-robin fashion. One number for each requests. By default, it is 1 request per second for 5 seconds (both can be changed: https://github.com/vhive-serverless/vSwarm/blob/main/tools/invoker/README.md).

The number in the file's name means the number of successful invocations per second (0.2 means 1 successful request in 5 seconds). The cold starts might cause a low success rate, since all requests that will respond after 5 seconds are considered failed. That's why rerunning the invoker helps to get more appropriate results.

ymc101 commented 4 months ago

I see. Can I also check how to run some of the benchmark functions in vSwarm? I am trying out the fibonacci one but i got a make: docker: Command not found error when trying to run make all-images. I tried installing docker with sudo apt-get install docker but am still getting the same error. Do you have an idea what might be the issue?

ymc101 commented 4 months ago

Hi @leokondrashov, do you have any input on this?

leokondrashov commented 4 months ago

I'm sorry. I thought I had sent the comment. It's better to ask questions about benchmarks in the vSwarm repository. But regarding the docker issue, docker is installed with apt install docker.io.

ymc101 commented 4 months ago

Alright, ill start a new issue in the vSwarm repository if I have further questions on the benchmarks. But for the docker issue, i tried installing docker.io, but it is unable to find the package:

E: Unable to locate package docker.io
E: Couldn't find any package by glob 'docker.io'
E: Couldn't find any package by regex 'docker.io'

Am I supposed to do sudo apt-get upgrade first?

leokondrashov commented 4 months ago

Yes, it is better to run sudo apt update. Please also provide the distro info, because it is weird that Ubuntu can't find the docker.io package.

Otherwise, proceed with official guides on docker installation: https://docs.docker.com/engine/install/ubuntu/.

ymc101 commented 4 months ago

Nothing gets updated when i run the upgrade, and it still cannot find docker.io package. I am running Ubuntu 20.04 LTS (GNU/Linux 5.4.0-164-generic x86_64) OS, based on the CloudLab profile provided in the vHive quickstart guide. Was the docker installation working fine when you tested on this setup previously?

Edit: it can find the package now after running update instead

leokondrashov commented 4 months ago

I just tried on fresh xl170 cloudlab node, it works. I ran two commands: sudo apt update; sudo apt install docker.io. Successfully installs and runs. Also, you might need to add the user to docker group: https://docs.docker.com/engine/install/linux-postinstall/, but that doesn't affect the availability of the package.

ymc101 commented 4 months ago

Thanks. The docker make works fine now. But now it seems like there is issue with deploying the benchmark, ill start a new issue on the vSwarm repository for that.