Closed Shikhar-Nahar closed 2 months ago
Hi @Shikhar-Nahar, Please check that all pods are in the "Running" state by using the kubectl get pods
command. Also please check diworker service logs for last 24 hours in order to take a look that you do not have running migrations. If migrations not running and pods not in Running state, try deleting them and wait until they restart. If this doesn't resolve the issue, run runkube.py with the "-d" flag to delete the cluster. After the pods are deleted, run runkube.py as usual to start the cluster again using the 2024080501-public version.
If the problem persists, please provide the output result of the kubectl get pods
command.
@stanfra It seems the bulldozerworker
is in init
bulldozerworker-6db7f5fbb-mxbbp 0/1 Init:3/4 0 18h
Rest all seem to be running.
(.venv) ubuntu@ip-10-99-21-212:~/optscale/optscale-deploy$ kubectl get deployments
NAME READY UP-TO-DATE AVAILABLE AGE
arcee 0/1 1 0 18h
auth 1/1 1 1 18h
bi-exporter 1/1 1 1 18h
booking-observer-worker 2/2 2 2 18h
bulldozer-api 1/1 1 1 18h
bulldozerworker 0/1 1 0 18h
bumischeduler 1/1 1 1 18h
bumiworker 1/1 1 1 18h
calendar-observer-worker 1/1 1 1 18h
diproxy 1/1 1 1 18h
diworker 1/1 1 1 18h
error-pages 1/1 1 1 18h
etcd-operator-etcd-operator-etcd-backup-operator 0/0 0 0 191d
etcd-operator-etcd-operator-etcd-operator 1/1 1 1 191d
etcd-operator-etcd-operator-etcd-restore-operator 0/0 0 0 191d
gemini-worker 1/1 1 1 18h
grafana 1/1 1 1 18h
herald-executor 1/1 1 1 18h
heraldapi 1/1 1 1 18h
heraldengine 2/2 2 2 18h
insider-api 1/1 1 1 18h
insider-worker 1/1 1 1 18h
jira-bus 1/1 1 1 18h
jira-ui 1/1 1 1 18h
kataraapi 1/1 1 1 18h
katarascheduler 1/1 1 1 18h
kataraworker 1/1 1 1 18h
keeper 1/1 1 1 18h
keeper-executor 1/1 1 1 18h
live-demo-generator-worker 1/1 1 1 18h
metroculusapi 1/1 1 1 18h
metroculusworker 1/1 1 1 18h
myadmin 1/1 1 1 18h
ngingress-nginx-ingress-default-backend 1/1 1 1 191d
ngui 1/1 1 1 18h
ohsu 1/1 1 1 18h
organization-violations-worker 1/1 1 1 18h
pharos-receiver 1/1 1 1 18h
pharos-worker 1/1 1 1 18h
power-schedule-worker 1/1 1 1 18h
resource-discovery-worker 2/2 2 2 18h
resource-observer-worker 1/1 1 1 18h
resource-violations-worker 1/1 1 1 18h
restapi 1/1 1 1 18h
risp-worker 1/1 1 1 18h
slacker 1/1 1 1 18h
slacker-executor 1/1 1 1 18h
thanos-query 1/1 1 1 18h
trapper-worker 1/1 1 1 18h
webhook-executor 1/1 1 1 18h
diworker
logs for last 24 hours has no other messages
bulldozer
logs
I will try to run runkube.py
with -d
and restart it with the 2024080501-public
version
arcee
pod is running however marked as not ready
(.venv) ubuntu@ip-10-99-21-212:~/optscale/optscale-deploy$ kubectl get pods
NAME READY STATUS RESTARTS AGE
arcee-5c8948845b-rn7cs 0/1 Running 0 18h
auth-5db4b8c9c4-d8xf8 1/1 Running 0 18h
bi-exporter-65667f8f79-q4h4r 1/1 Running 0 18h
bi-scheduler-1723187700-wpw77 0/1 Completed 0 2m51s
booking-observer-scheduler-1723187820-tw5c6 0/1 Completed 0 58s
booking-observer-worker-6457d86766-26wsh 1/1 Running 0 18h
booking-observer-worker-6457d86766-hmvlf 1/1 Running 0 18h
bulldozer-api-6bb7f45454-wtzhc 1/1 Running 0 18h
bulldozerworker-6db7f5fbb-mxbbp 0/1 Init:3/4 0 18h
bumischeduler-8678c867cf-r9l79 1/1 Running 0 18h
bumiworker-6c66495694-2s7db 1/1 Running 0 18h
.........
Arcee logs :
Please perform several steps: 1) In etcd it is necessary to check if there are locks in all _locks folders To connect to etcd you may use "kubectl exec -ti etcd-0 sh" command 2) If locks exists need to delete them and check in logs that service start migration(for example if you delete lock in /_locks/arcee_migrations need to check arcee logs) 3) If lock deleted and service do not start migration delete this service pod and check that migration started
If migration do not started please provide service logs
And please keep in mind that after migration started new locks will be created, please do not delete them
@stanfra I could not find _locks
directory in etcd-0
pod
(.venv) ubuntu@ip-10-99-21-212:~/optscale/optscale-deploy$ kubectl exec -ti etcd-0 sh
/ # cd /
/ # find . -iname _locks -type d
/ #
Does this mean there are no locks?
Also sharing the service log for arcee
after the delete and starting the cluster again.
@Shikhar-Nahar , after you connected to etcd please execute command etcdctl ls _locks
, you will see output:
https://github.com/user-attachments/assets/baa4604c-fab9-414a-871b-9dbb917b83a8
then perform command etcdctl ls _locks/<service_name>_migrations
for each service and take and check outputs. If output is empty that means that there are no locks for this service. If output not empty, delete lock using command etcdctl rm _locks/<service_name>_migrations/<lock_name>
Thanks @stanfra for helping with the detailed steps. I see 3 services have 5 locks each
/ # etcdctl ls _locks
/_locks/arcee_migrations
/_locks/bulldozer_migrations
/_locks/gemini_migrations
/_locks/metroculus_migrations
/_locks/restapi_migrations
/_locks/risp_migrations
/_locks/diworker_migrations
/_locks/insider_migrations
/_locks/jira_bus_migrations
/_locks/slacker_migrations
/ # etcdctl ls /_locks/arcee_migrations
/_locks/arcee_migrations/00000000000000001408
/_locks/arcee_migrations/00000000000000001423
/_locks/arcee_migrations/00000000000000000731
/_locks/arcee_migrations/00000000000000000763
/_locks/arcee_migrations/00000000000000000977
/ # etcdctl ls /_locks/bulldozer_migrations
/ # etcdctl ls /_locks/gemini_migrations
/ # etcdctl ls /_locks/metroculus_migrations
/ # etcdctl ls /_locks/restapi_migrations
/ # etcdctl ls /_locks/risp_migrations
/_locks/risp_migrations/00000000000000000735
/_locks/risp_migrations/00000000000000000964
/_locks/risp_migrations/00000000000000000982
/_locks/risp_migrations/00000000000000001201
/_locks/risp_migrations/00000000000000001424
/ # etcdctl ls /_locks/diworker_migrations
/_locks/diworker_migrations/00000000000000000746
/_locks/diworker_migrations/00000000000000000967
/_locks/diworker_migrations/00000000000000001189
/_locks/diworker_migrations/00000000000000001409
/_locks/diworker_migrations/00000000000000001633
/ # etcdctl ls /_locks/insider_migrations
/ # etcdctl ls /_locks/jira_bus_migrations
/ # etcdctl ls /_locks/slacker_migrations
WIll start by deleting all the locks for arcee
and proceed further with the other 2 services as well and will update
Thanks @stanfra After removing all locks. A new arcee migration lock got created and migration completed successfully post that
(.venv) ubuntu@ip-10-99-21-212:~/optscale/optscale-deploy$ kubectl exec -ti etcd-0 sh
/ # etcdctl ls /_locks/arcee_migrations && etcdctl ls /_locks/bulldozer_migrations && etcdctl ls /_locks/gemi
ni_migrations && etcdctl ls /_locks/metroculus_migrations && etcdctl ls /_locks/restapi_migrations && etcdctl
ls /_locks/risp_migrations && etcdctl ls /_locks/diworker_migrations && etcdctl ls /_locks/insider_migrations
&& etcdctl ls /_locks/jira_bus_migrations && etcdctl ls /_locks/slacker_migrations
/_locks/arcee_migrations/00000000000000001639
Now all deployments in the cluster are available. 👍 Will check after an hour if billing imports are done and will share diworker logs if there are issues.
@stanfra There still hasn't been a billing import yet. Sharing the log for diworker. Is there any way to find out the progress or where to check logs/status for the next import?
ubuntu@ip-10-99-21-212:~$ kubectl get cronjobs | grep -i re
report-import-scheduler-0 */15 * * * * False 0 9m43s 153m
report-import-scheduler-1 0 * * * * False 0 39m 153m
report-import-scheduler-24 0 0 * * * False 0 <none> 153m
report-import-scheduler-6 0 */6 * * * False 0 <none> 153m
resource-discovery-scheduler */5 * * * * False 0 4m43s 153m
resource-observer-scheduler */5 * * * * False 0 4m43s 153m
resource-violations-scheduler */5 * * * * False 0 4m43s 153m
@Shikhar-Nahar , you may trigger report import job using command kubectl create job --from=cronjobs/report-import-scheduler-1 reportimport
, report import starts by cronjob report-import-scheduler-1 at the beginning of every hour
Yes after triggering the import job, I see from the diworker logs billing import has started on the upgraded version. But one of the cloud account source account for GCP the billing import failed with below error.
Logs from diworker around the time when this failed.
However this is a different issue. So we can track it as separate issue.
@stanfra Thanks for the quick assistance with this.
Describe the bug Billing imports are not being performed after cluster is upgraded to latest version
To Reproduce Steps to reproduce the behaviour:
2024012901-public
2024080501-public
as per https://github.com/hystax/optscale?tab=readme-ov-file#cluster-updateExpected behaviour Billing imports must happen as per schedule
Screenshots
Desktop (please complete the following information):