Closed shrekris-anyscale closed 1 year ago
We had a hypothesis that the user was running into #539, which was fixed by #540. The user upgraded Kuberay to master (which contains #540), but they could still reproduce the error.
kubectl replace -k "github.com/ray-project/kuberay/manifests/cluster-scope-resources?ref=${KUBERAY_VERSION}&timeout=90s"
Then, they created a new namespace-scope operator using the following config. It uses the image kuberay/operator:nightly
:
apiVersion: v1
kind: Namespace
metadata:
name: ray-serve-prototype-v2
labels:
app: ray
owner: alexandre.gariepy
---
#############################################################################################################
# Operator-related
#############################################################################################################
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app.kubernetes.io/component: kuberay-operator
app.kubernetes.io/name: kuberay
app: ray
owner: alexandre.gariepy
name: kuberay-operator
namespace: ray-serve-prototype-v2
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/component: kuberay-operator
app.kubernetes.io/name: kuberay
owner: alexandre.gariepy
name: kuberay-operator
namespace: ray-serve-prototype-v2
rules:
- apiGroups:
- coordination.k8s.io
resources:
- leases
verbs:
- create
- get
- list
- update
- apiGroups:
- ""
resources:
- events
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- ""
resources:
- pods
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- ""
resources:
- pods/status
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- ""
resources:
- serviceaccounts
verbs:
- create
- delete
- get
- list
- watch
- apiGroups:
- ""
resources:
- services
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- ""
resources:
- services/status
verbs:
- get
- patch
- update
- apiGroups:
- extensions
resources:
- ingresses
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- networking.k8s.io
resources:
- ingressclasses
verbs:
- get
- list
- watch
- apiGroups:
- networking.k8s.io
resources:
- ingresses
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- ray.io
resources:
- rayclusters
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- ray.io
resources:
- rayclusters/finalizer
verbs:
- update
- apiGroups:
- ray.io
resources:
- rayclusters/status
verbs:
- get
- patch
- update
- apiGroups:
- ray.io
resources:
- rayjobs
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- ray.io
resources:
- rayjobs/finalizer
verbs:
- update
- apiGroups:
- ray.io
resources:
- rayjobs/status
verbs:
- get
- patch
- update
- apiGroups:
- ray.io
resources:
- rayservices
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- ray.io
resources:
- rayservices/finalizers
verbs:
- update
- apiGroups:
- ray.io
resources:
- rayservices/status
verbs:
- get
- patch
- update
- apiGroups:
- rbac.authorization.k8s.io
resources:
- rolebindings
verbs:
- create
- delete
- get
- list
- watch
- apiGroups:
- rbac.authorization.k8s.io
resources:
- roles
verbs:
- create
- delete
- get
- list
- update
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
labels:
app.kubernetes.io/component: kuberay-operator
app.kubernetes.io/name: kuberay
owner: alexandre.gariepy
name: kuberay-operator-leader-election
namespace: ray-serve-prototype-v2
rules:
- apiGroups:
- ""
resources:
- configmaps
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- ""
resources:
- configmaps/status
verbs:
- get
- update
- patch
- apiGroups:
- ""
resources:
- events
verbs:
- create
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
labels:
app.kubernetes.io/component: kuberay-operator
app.kubernetes.io/name: kuberay
owner: alexandre.gariepy
name: kuberay-operator
namespace: ray-serve-prototype-v2
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: kuberay-operator
subjects:
- kind: ServiceAccount
name: kuberay-operator
namespace: ray-serve-prototype-v2
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
labels:
app.kubernetes.io/component: kuberay-operator
app.kubernetes.io/name: kuberay
owner: alexandre.gariepy
name: kuberay-operator-leader-election
namespace: ray-serve-prototype-v2
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: kuberay-operator-leader-election
subjects:
- kind: ServiceAccount
name: kuberay-operator
namespace: ray-serve-prototype-v2
---
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "8080"
prometheus.io/scrape: "true"
labels:
app.kubernetes.io/component: kuberay-operator
app.kubernetes.io/name: kuberay
owner: alexandre.gariepy
name: kuberay-operator
namespace: ray-serve-prototype-v2
spec:
ports:
- name: monitoring-port
port: 8080
targetPort: 8080
selector:
app.kubernetes.io/component: kuberay-operator
app.kubernetes.io/name: kuberay
type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/component: kuberay-operator
app.kubernetes.io/name: kuberay
app: ray
owner: alexandre.gariepy
name: kuberay-operator
namespace: ray-serve-prototype-v2
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/component: kuberay-operator
app.kubernetes.io/name: kuberay
template:
metadata:
labels:
app.kubernetes.io/component: kuberay-operator
app.kubernetes.io/name: kuberay
owner: alexandre.gariepy
spec:
containers:
- command:
- /manager
- -watch-namespace
- ray-serve-prototype-v2
image: kuberay/operator:nightly
livenessProbe:
failureThreshold: 5
httpGet:
path: /metrics
port: http
initialDelaySeconds: 10
periodSeconds: 5
name: kuberay-operator
ports:
- containerPort: 8080
name: http
protocol: TCP
readinessProbe:
failureThreshold: 5
httpGet:
path: /metrics
port: http
initialDelaySeconds: 10
periodSeconds: 5
resources:
limits:
cpu: 100m
memory: 100Mi
requests:
cpu: 100m
memory: 50Mi
securityContext:
allowPrivilegeEscalation: false
securityContext:
runAsNonRoot: true
serviceAccountName: kuberay-operator
terminationGracePeriodSeconds: 10
Then, they deployed a RayService
using the exact same config as the issue body:
#############################################################################################################
# RayService related
#############################################################################################################
# Make sure to increase resource requests and limits before using this example in production.
# For examples with more realistic resource configuration, see
# ray-cluster.complete.large.yaml and
# ray-cluster.autoscaler.large.yaml.
apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
name: rayservice-xgboost-model
spec:
serviceUnhealthySecondThreshold: 300 # Config for the health check threshold for service. Default value is 60.
deploymentUnhealthySecondThreshold: 300 # Config for the health check threshold for deployments. Default value is 60.
serveConfig:
importPath: example_xgboost.model
deployments:
- name: FraudDetection
numReplicas: 3
routePrefix: "/"
rayClusterConfig:
rayVersion: '2.0.0' # should match the Ray version in the image of the containers
######################headGroupSpecs#################################
# head group template and specs, (perhaps 'group' is not needed in the name)
headGroupSpec:
# Kubernetes Service Type, valid values are 'ClusterIP', 'NodePort' and 'LoadBalancer'
serviceType: ClusterIP
# the pod replicas in this group typed head (assuming there could be more than 1 in the future)
replicas: 1
# logical group name, for this called head-group, also can be functional
# pod type head or worker
# rayNodeType: head # Not needed since it is under the headgroup
# the following params are used to complete the ray start: ray start --head --block --redis-port=6379 ...
rayStartParams:
port: '6379' # should match container port named gcs-server
#include_webui: 'true'
object-store-memory: '100000000'
# webui_host: "10.1.2.60"
dashboard-host: '0.0.0.0'
num-cpus: '2' # can be auto-completed from the limits
node-ip-address: $MY_POD_IP # auto-completed as the head pod IP
block: 'true'
#pod template
template:
metadata:
labels:
# custom labels. NOTE: do not define custom labels start with `raycluster.`, they may be used in controller.
# Refer to https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
rayCluster: raycluster-sample # will be injected if missing
rayNodeType: head # will be injected if missing, must be head or wroker
groupName: headgroup # will be injected if missing
# annotations for pod
annotations:
key: value
spec:
nodeSelector:
node.kubernetes.io/instance-type: n1-standard-8
containers:
- name: ray-head
image: gcr.io/shopify-docker-images/apps/app/ray-protwotype:addc9ad3b27d00ff4b64b4b9800c69d50ece7fa8
imagePullPolicy: Always
#image: bonsaidev.azurecr.io/bonsai/lazer-0-9-0-cpu:dev
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: SERVE_DEPLOYMENT_HANDLE_IS_SYNC
value: "0"
resources:
limits:
cpu: 2
memory: 2Gi
requests:
cpu: 2
memory: 2Gi
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265 # Ray dashboard
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 3
minReplicas: 3
maxReplicas: 3
# logical group name, for this called small-group, also can be functional
groupName: small-group
# if worker pods need to be added, we can simply increment the replicas
# if worker pods need to be removed, we decrement the replicas, and populate the podsToDelete list
# the operator will remove pods from the list until the number of replicas is satisfied
# when a pod is confirmed to be deleted, its name will be removed from the list below
#scaleStrategy:
# workersToDelete:
# - raycluster-complete-worker-small-group-bdtwh
# - raycluster-complete-worker-small-group-hv457
# - raycluster-complete-worker-small-group-k8tj7
# the following params are used to complete the ray start: ray start --block --node-ip-address= ...
rayStartParams:
node-ip-address: $MY_POD_IP
block: 'true'
#pod template
template:
metadata:
labels:
key: value
# annotations for pod
annotations:
key: value
spec:
nodeSelector:
node.kubernetes.io/instance-type: n1-standard-8
initContainers:
# the env var $RAY_IP is set by the operator if missing, with the value of the head service name
- name: init-myservice
image: busybox:1.28
command: ['sh', '-c', "until nslookup $RAY_IP.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
containers:
- name: machine-learning # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc'
image: gcr.io/shopify-docker-images/apps/app/ray-protwotype:addc9ad3b27d00ff4b64b4b9800c69d50ece7fa8
imagePullPolicy: Always
# environment variables to set in the container.Optional.
# Refer to https://kubernetes.io/docs/tasks/inject-data-application/define-environment-variable-container/
env:
- name: RAY_DISABLE_DOCKER_CPU_WARNING
value: "1"
- name: TYPE
value: "worker"
- name: CPU_REQUEST
valueFrom:
resourceFieldRef:
containerName: machine-learning
resource: requests.cpu
- name: CPU_LIMITS
valueFrom:
resourceFieldRef:
containerName: machine-learning
resource: limits.cpu
- name: MEMORY_LIMITS
valueFrom:
resourceFieldRef:
containerName: machine-learning
resource: limits.memory
- name: MEMORY_REQUESTS
valueFrom:
resourceFieldRef:
containerName: machine-learning
resource: requests.memory
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: SERVE_DEPLOYMENT_HANDLE_IS_SYNC
value: "0"
ports:
- containerPort: 80
name: client
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
resources:
limits:
cpu: "2"
memory: "2Gi"
requests:
cpu: "2"
memory: "2Gi"
Note: In this case, the issue occurred the first time the user deployed the RayService
, not when they upgraded an existing RayService
.
Hi @shrekris-anyscale, does this issue still persist?
@sihanwang41 is this issue resolved by your change in #1014?
@sihanwang41 is this issue resolved by your change in #1014?
yes!
Great! @kevin85421 this issue should be resolved, so I'm closing it.
Search before asking
KubeRay Component
Others
What happened + What you expected to happen
A user had this Serve app:
Python Code
Docker Image
Kubernetes Config
Here's what the user described:
The user saw many log files on the worker node:
Sample log file
The
.err
files only contained:task_name:run_graph
.Reproduction script
See above for code and logs.
Anything else
No response
Are you willing to submit a PR?