hystax / optscale

FinOps, MLOps and cloud cost optimization tool. Supports AWS, Azure, GCP, Alibaba Cloud and Kubernetes.
https://hystax.com
Apache License 2.0
1.27k stars 176 forks source link

Billing Import not happening after upgrade to latest version #366

Closed Shikhar-Nahar closed 2 months ago

Shikhar-Nahar commented 2 months ago

Describe the bug Billing imports are not being performed after cluster is upgraded to latest version

To Reproduce Steps to reproduce the behaviour:

  1. Optscale cluster with sources added in version - 2024012901-public
  2. upgrade cluster to version - 2024080501-public as per https://github.com/hystax/optscale?tab=readme-ov-file#cluster-update
  3. Check source billing imports are not happening anymore

Expected behaviour Billing imports must happen as per schedule

Screenshots

image image

Desktop (please complete the following information):

stanfra commented 2 months ago

Hi @Shikhar-Nahar, Please check that all pods are in the "Running" state by using the kubectl get pods command. Also please check diworker service logs for last 24 hours in order to take a look that you do not have running migrations. If migrations not running and pods not in Running state, try deleting them and wait until they restart. If this doesn't resolve the issue, run runkube.py with the "-d" flag to delete the cluster. After the pods are deleted, run runkube.py as usual to start the cluster again using the 2024080501-public version.

If the problem persists, please provide the output result of the kubectl get pods command.

Shikhar-Nahar commented 2 months ago

@stanfra It seems the bulldozerworker is in init bulldozerworker-6db7f5fbb-mxbbp 0/1 Init:3/4 0 18h

Rest all seem to be running.

(.venv) ubuntu@ip-10-99-21-212:~/optscale/optscale-deploy$ kubectl get deployments
NAME                                                READY   UP-TO-DATE   AVAILABLE   AGE
arcee                                               0/1     1            0           18h
auth                                                1/1     1            1           18h
bi-exporter                                         1/1     1            1           18h
booking-observer-worker                             2/2     2            2           18h
bulldozer-api                                       1/1     1            1           18h
bulldozerworker                                     0/1     1            0           18h
bumischeduler                                       1/1     1            1           18h
bumiworker                                          1/1     1            1           18h
calendar-observer-worker                            1/1     1            1           18h
diproxy                                             1/1     1            1           18h
diworker                                            1/1     1            1           18h
error-pages                                         1/1     1            1           18h
etcd-operator-etcd-operator-etcd-backup-operator    0/0     0            0           191d
etcd-operator-etcd-operator-etcd-operator           1/1     1            1           191d
etcd-operator-etcd-operator-etcd-restore-operator   0/0     0            0           191d
gemini-worker                                       1/1     1            1           18h
grafana                                             1/1     1            1           18h
herald-executor                                     1/1     1            1           18h
heraldapi                                           1/1     1            1           18h
heraldengine                                        2/2     2            2           18h
insider-api                                         1/1     1            1           18h
insider-worker                                      1/1     1            1           18h
jira-bus                                            1/1     1            1           18h
jira-ui                                             1/1     1            1           18h
kataraapi                                           1/1     1            1           18h
katarascheduler                                     1/1     1            1           18h
kataraworker                                        1/1     1            1           18h
keeper                                              1/1     1            1           18h
keeper-executor                                     1/1     1            1           18h
live-demo-generator-worker                          1/1     1            1           18h
metroculusapi                                       1/1     1            1           18h
metroculusworker                                    1/1     1            1           18h
myadmin                                             1/1     1            1           18h
ngingress-nginx-ingress-default-backend             1/1     1            1           191d
ngui                                                1/1     1            1           18h
ohsu                                                1/1     1            1           18h
organization-violations-worker                      1/1     1            1           18h
pharos-receiver                                     1/1     1            1           18h
pharos-worker                                       1/1     1            1           18h
power-schedule-worker                               1/1     1            1           18h
resource-discovery-worker                           2/2     2            2           18h
resource-observer-worker                            1/1     1            1           18h
resource-violations-worker                          1/1     1            1           18h
restapi                                             1/1     1            1           18h
risp-worker                                         1/1     1            1           18h
slacker                                             1/1     1            1           18h
slacker-executor                                    1/1     1            1           18h
thanos-query                                        1/1     1            1           18h
trapper-worker                                      1/1     1            1           18h
webhook-executor                                    1/1     1            1           18h

diworker logs for last 24 hours has no other messages

image

bulldozer logs

image

I will try to run runkube.py with -d and restart it with the 2024080501-public version

Shikhar-Nahar commented 2 months ago

arcee pod is running however marked as not ready

(.venv) ubuntu@ip-10-99-21-212:~/optscale/optscale-deploy$ kubectl get pods
NAME                                                         READY   STATUS      RESTARTS   AGE
arcee-5c8948845b-rn7cs                                       0/1     Running     0          18h
auth-5db4b8c9c4-d8xf8                                        1/1     Running     0          18h
bi-exporter-65667f8f79-q4h4r                                 1/1     Running     0          18h
bi-scheduler-1723187700-wpw77                                0/1     Completed   0          2m51s
booking-observer-scheduler-1723187820-tw5c6                  0/1     Completed   0          58s
booking-observer-worker-6457d86766-26wsh                     1/1     Running     0          18h
booking-observer-worker-6457d86766-hmvlf                     1/1     Running     0          18h
bulldozer-api-6bb7f45454-wtzhc                               1/1     Running     0          18h
bulldozerworker-6db7f5fbb-mxbbp                              0/1     Init:3/4    0          18h
bumischeduler-8678c867cf-r9l79                               1/1     Running     0          18h
bumiworker-6c66495694-2s7db                                  1/1     Running     0          18h
.........

Arcee logs :

image
stanfra commented 2 months ago

Please perform several steps: 1) In etcd it is necessary to check if there are locks in all _locks folders To connect to etcd you may use "kubectl exec -ti etcd-0 sh" command 2) If locks exists need to delete them and check in logs that service start migration(for example if you delete lock in /_locks/arcee_migrations need to check arcee logs) 3) If lock deleted and service do not start migration delete this service pod and check that migration started

If migration do not started please provide service logs

stanfra commented 2 months ago

And please keep in mind that after migration started new locks will be created, please do not delete them

Shikhar-Nahar commented 2 months ago

@stanfra I could not find _locks directory in etcd-0 pod

(.venv) ubuntu@ip-10-99-21-212:~/optscale/optscale-deploy$ kubectl exec -ti etcd-0 sh
/ # cd /
/ # find . -iname _locks -type d
/ # 

Does this mean there are no locks?

Also sharing the service log for arcee after the delete and starting the cluster again.

image
stanfra commented 2 months ago

@Shikhar-Nahar , after you connected to etcd please execute command etcdctl ls _locks, you will see output: https://github.com/user-attachments/assets/baa4604c-fab9-414a-871b-9dbb917b83a8 then perform command etcdctl ls _locks/<service_name>_migrations for each service and take and check outputs. If output is empty that means that there are no locks for this service. If output not empty, delete lock using command etcdctl rm _locks/<service_name>_migrations/<lock_name>

Shikhar-Nahar commented 2 months ago

Thanks @stanfra for helping with the detailed steps. I see 3 services have 5 locks each

/ # etcdctl ls _locks
/_locks/arcee_migrations
/_locks/bulldozer_migrations
/_locks/gemini_migrations
/_locks/metroculus_migrations
/_locks/restapi_migrations
/_locks/risp_migrations
/_locks/diworker_migrations
/_locks/insider_migrations
/_locks/jira_bus_migrations
/_locks/slacker_migrations
/ # etcdctl ls /_locks/arcee_migrations
/_locks/arcee_migrations/00000000000000001408
/_locks/arcee_migrations/00000000000000001423
/_locks/arcee_migrations/00000000000000000731
/_locks/arcee_migrations/00000000000000000763
/_locks/arcee_migrations/00000000000000000977
/ # etcdctl ls /_locks/bulldozer_migrations
/ # etcdctl ls /_locks/gemini_migrations
/ # etcdctl ls /_locks/metroculus_migrations
/ # etcdctl ls /_locks/restapi_migrations
/ # etcdctl ls /_locks/risp_migrations
/_locks/risp_migrations/00000000000000000735
/_locks/risp_migrations/00000000000000000964
/_locks/risp_migrations/00000000000000000982
/_locks/risp_migrations/00000000000000001201
/_locks/risp_migrations/00000000000000001424
/ # etcdctl ls /_locks/diworker_migrations
/_locks/diworker_migrations/00000000000000000746
/_locks/diworker_migrations/00000000000000000967
/_locks/diworker_migrations/00000000000000001189
/_locks/diworker_migrations/00000000000000001409
/_locks/diworker_migrations/00000000000000001633
/ # etcdctl ls /_locks/insider_migrations
/ # etcdctl ls /_locks/jira_bus_migrations
/ # etcdctl ls /_locks/slacker_migrations

WIll start by deleting all the locks for arcee and proceed further with the other 2 services as well and will update

Shikhar-Nahar commented 2 months ago

Thanks @stanfra After removing all locks. A new arcee migration lock got created and migration completed successfully post that

(.venv) ubuntu@ip-10-99-21-212:~/optscale/optscale-deploy$ kubectl exec -ti etcd-0 sh
/ #  etcdctl ls /_locks/arcee_migrations && etcdctl ls /_locks/bulldozer_migrations && etcdctl ls /_locks/gemi
ni_migrations && etcdctl ls /_locks/metroculus_migrations && etcdctl ls /_locks/restapi_migrations && etcdctl 
ls /_locks/risp_migrations && etcdctl ls /_locks/diworker_migrations && etcdctl ls /_locks/insider_migrations 
&& etcdctl ls /_locks/jira_bus_migrations && etcdctl ls /_locks/slacker_migrations

/_locks/arcee_migrations/00000000000000001639

Now all deployments in the cluster are available. 👍 Will check after an hour if billing imports are done and will share diworker logs if there are issues.

Shikhar-Nahar commented 2 months ago

@stanfra There still hasn't been a billing import yet. Sharing the log for diworker. Is there any way to find out the progress or where to check logs/status for the next import?

image
ubuntu@ip-10-99-21-212:~$ kubectl get cronjobs | grep -i re
report-import-scheduler-0           */15 * * * *   False     0        9m43s           153m
report-import-scheduler-1           0 * * * *      False     0        39m             153m
report-import-scheduler-24          0 0 * * *      False     0        <none>          153m
report-import-scheduler-6           0 */6 * * *    False     0        <none>          153m
resource-discovery-scheduler        */5 * * * *    False     0        4m43s           153m
resource-observer-scheduler         */5 * * * *    False     0        4m43s           153m
resource-violations-scheduler       */5 * * * *    False     0        4m43s           153m
stanfra commented 2 months ago

@Shikhar-Nahar , you may trigger report import job using command kubectl create job --from=cronjobs/report-import-scheduler-1 reportimport, report import starts by cronjob report-import-scheduler-1 at the beginning of every hour

Shikhar-Nahar commented 2 months ago

Yes after triggering the import job, I see from the diworker logs billing import has started on the upgraded version. But one of the cloud account source account for GCP the billing import failed with below error.

image

Logs from diworker around the time when this failed.

image

However this is a different issue. So we can track it as separate issue.

@stanfra Thanks for the quick assistance with this.