palantir / k8s-spark-scheduler

A Kubernetes Scheduler Extender to provide gang scheduling support for Spark on Kubernetes
Apache License 2.0
175 stars 43 forks source link

pods go in Pending state intermittently, scheduler restart solves the issue #251

Open hunny-garg opened 1 year ago

hunny-garg commented 1 year ago

We are facing an issue in our env where Spark pods go in Pending state intermittently. We have to restart Spark scheduler pods to fix the issue. We are seeing below errors in spark-scheduler-extender logs...not sure this is related to the issue Looking for some pointers to explain this odd behaviour.

k8s version: v1.23 spark-scheduler version: v0.58.0

"stacktrace": "error when looking for already bound reservations\nfailed to get resource reservations podName:agg-spark-350zvn28en0u-b29f74875b02ba23-exec-1, podNamespace:prod01\n\ngithub.com/palantir/k8s-spark-scheduler/internal/extender.(*ResourceReservationManager).FindAlreadyBoundReservationNode\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/internal/extender/resourcereservations.go:141\ngithub.com/palantir/k8s-spark-scheduler/internal/extender.(*SparkSchedulerExtender).selectExecutorNode\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/internal/extender/resource.go:382\ngithub.com/palantir/k8s-spark-scheduler/internal/extender.(*SparkSchedulerExtender).selectNode\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/internal/extender/resource.go:210\ngithub.com/palantir/k8s-spark-scheduler/internal/extender.(*SparkSchedulerExtender).Predicate\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/internal/extender/resource.go:151\ngithub.com/palantir/k8s-spark-scheduler/cmd.registerExtenderEndpoints.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/cmd/endpoints.go:36\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2109\ngithub.com/palantir/witchcraft-go-server/wrouter.(*rootRouter).Register.func1.1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:136\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/middleware.NewRouteLogTraceSpan.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/middleware/route.go:107\ngithub.com/palantir/witchcraft-go-server/wrouter.(*routeRequestHandlerWithNext).HandleRequest\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:150\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/middleware.NewRouteRequestLog.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/middleware/route.go:32\ngithub.com/palantir/witchcraft-go-server/wrouter.(*routeRequestHandlerWithNext).HandleRequest\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:150\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/middleware.NewRequestMetricRequestMeter.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/middleware/request.go:168\ngithub.com/palantir/witchcraft-go-server/wrouter.(*routeRequestHandlerWithNext).HandleRequest\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:150\ngithub.com/palantir/witchcraft-go-server/wrouter.(*rootRouter).Register.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:139\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2109\ngithub.com/julienschmidt/httprouter.(*Router).Handler.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/julienschmidt/httprouter/router.go:275\ngithub.com/julienschmidt/httprouter.(*Router).ServeHTTP\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/julienschmidt/httprouter/router.go:387\ngithub.com/palantir/witchcraft-go-server/wrouter/whttprouter.(*router).ServeHTTP\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/whttprouter/routerimpl.go:71\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/middleware.NewRequestExtractIDs.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/middleware/request.go:139\ngithub.com/palantir/witchcraft-go-server/wrouter.(*requestHandlerWithNext).ServeHTTP\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:250\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/middleware.NewRequestContextLoggers.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/middleware/request.go:73\ngithub.com/palantir/witchcraft-go-server/wrouter.(*requestHandlerWithNext).ServeHTTP\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:250\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/middleware.NewRequestContextMetricsRegistry.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/middleware/request.go:84\ngithub.com/palantir/witchcraft-go-server/wrouter.(*requestHandlerWithNext).ServeHTTP\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:250\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/middleware.NewRequestPanicRecovery.func1.1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/middleware/request.go:42\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/negroni.(*Recovery).ServeHTTP\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/negroni/recovery.go:193\ngithub.com/palantir/witchcraft-go-server/witchcraft/internal/middleware.NewRequestPanicRecovery.func1\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/witchcraft/internal/middleware/request.go:41\ngithub.com/palantir/witchcraft-go-server/wrouter.(*requestHandlerWithNext).ServeHTTP\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:250\ngithub.com/palantir/witchcraft-go-server/wrouter.(*rootRouter).ServeHTTP\n\t/home/circleci/go/src/github.com/palantir/k8s-spark-scheduler/vendor/github.com/palantir/witchcraft-go-server/wrouter/router_root.go:103\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2947\nnet/http.initALPNRequest.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:3556\nnet/http.(*http2serverConn).runHandler\n\t/usr/local/go/src/net/http/h2_bundle.go:5910",

hunny-garg commented 1 year ago

we also see below errors in spark-scheduler-extender container logs when this issue start occuring.

{"type":"service.1","time":"2023-04-08T02:39:45.830415574Z","level":"WARN","origin":"github.com/palantir/k8s-spark-scheduler","message":"found unexplained cache size difference","params":{"rrs":0,"rrsCached":109}}