hystax / optscale

FinOps, MLOps and cloud cost optimization tool. Supports AWS, Azure, GCP, Alibaba Cloud and Kubernetes.
https://hystax.com
Apache License 2.0
1.29k stars 177 forks source link

Arcee and BI_Exporter Containers Are Broken #470

Closed GiovanniYuriMita closed 4 hours ago

GiovanniYuriMita commented 1 day ago

Describe the bug Arcee and BI_Exporter containers are in a crash loop backoff since the deployment of the cluster. Even when upgrading the cluster version to latest, I still get the same errors.

To Reproduce Steps to reproduce the behavior:

sudo apt update; sudo apt install python3-pip sshpass git python3-virtualenv python3.9 python3.9-venv
git clone https://github.com/hystax/optscale.git
cd optscale/optscale-deploy
virtualenv -p python3.9 venv
source venv/bin/activate
pip install -r requirements.txt
ansible-playbook -e "ansible_connection=local ansible_user=ubuntu" -i "<your-ip>," ansible/k8s-master.yaml
./runkube.py --with-elk  -o overlay/user_template.yml -- cloudcose-deployment 2024110701-public

Logs arcee:

/usr/local/lib/python3.12/site-packages/mongodb_migrations/cli.py:32: SyntaxWarning: invalid escape sequence '\d'
  result = re.match('^(\d+)[_a-z]*\.py$', file)
/usr/local/lib/python3.12/site-packages/pydantic/_internal/_config.py:317: UserWarning: Valid config keys have changed in V2:
* 'allow_population_by_field_name' has been renamed to 'populate_by_name'
  warnings.warn(message, UserWarning)
/usr/local/lib/python3.12/site-packages/pydantic/_internal/_fields.py:128: UserWarning: Field "model_id" has conflict with protected namespace "model_".

You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.
  warnings.warn(
[2024-11-21 11:33:13 +0000] [1] [INFO] Waiting for migration lock
Found previous migrations, last migration is version: 20240815104530
[2024-11-21 11:33:13 +0000] [1] [INFO] Starting server
[2024-11-21 11:33:13 +0000] [1] [INFO] Sanic v23.12.1
[2024-11-21 11:33:13 +0000] [1] [INFO] Goin' Fast @ http://0.0.0.0:8891
[2024-11-21 11:33:13 +0000] [1] [INFO] app: arcee
[2024-11-21 11:33:13 +0000] [1] [INFO] mode: production, single worker
[2024-11-21 11:33:13 +0000] [1] [INFO] server: sanic, HTTP/1.1
[2024-11-21 11:33:13 +0000] [1] [INFO] python: 3.12.3
[2024-11-21 11:33:13 +0000] [1] [INFO] platform: Linux-5.15.0-1068-aws-x86_64-with-glibc2.36
[2024-11-21 11:33:13 +0000] [1] [INFO] packages: sanic-routing==23.12.0, sanic-ext==23.12.0
Traceback (most recent call last):
  File "/usr/src/app/arcee/arcee_receiver/server.py", line 2694, in <module>
    app.run(host='0.0.0.0', port=8891, access_log=False)
  File "/usr/local/lib/python3.12/site-packages/sanic/mixins/startup.py", line 290, in run
    serve(primary=self)  # type: ignore
    ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sanic/mixins/startup.py", line 1032, in serve
    sync_manager = Manager()
                   ^^^^^^^^^
  File "/usr/local/lib/python3.12/multiprocessing/context.py", line 57, in Manager
    m.start()
  File "/usr/local/lib/python3.12/multiprocessing/managers.py", line 562, in start
    self._process.start()
  File "/usr/local/lib/python3.12/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/multiprocessing/context.py", line 289, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/local/lib/python3.12/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/local/lib/python3.12/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/usr/local/lib/python3.12/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
NotImplementedError: object proxy must define __reduce_ex__()

bi-exporter:

Waiting until cluster initialization completed
2024-11-21 13:08:08,009 INFO [optscale_client.config_client.client] [client.py:318] [dd.service=bi-exporter dd.env=prod-bw dd.version=2024110701-public dd.trace_id=0 dd.span_id=0] - Waiting until cluster initialization completed
2024-11-21 13:08:08,089 INFO [root] [main.py:123] [dd.service=bi-exporter dd.env=prod-bw dd.version=2024110701-public dd.trace_id=0 dd.span_id=0] - Starting to consume...
Starting to consume...
Connected to amqp://optscale:**@rabbitmq:5672//
2024-11-21 13:08:08,102 INFO [kombu.mixins] [mixins.py:228] [dd.service=bi-exporter dd.env=prod-bw dd.version=2024110701-public dd.trace_id=0 dd.span_id=0] - Connected to amqp://optscale:**@rabbitmq:5672//
Received message body type: <class 'str'>
Received message body content: {

  key:value

}
2024-11-21 13:08:08,108 INFO [root] [main.py:84] [dd.service=bi-exporter dd.env=prod-bw dd.version=2024110701-public dd.trace_id=0 dd.span_id=0] - Received message body type: <class 'str'>
2024-11-21 13:08:08,108 INFO [root] [main.py:85] [dd.service=bi-exporter dd.env=prod-bw dd.version=2024110701-public dd.trace_id=0 dd.span_id=0] - Received message body content: {

  key:value

}
Traceback (most recent call last):
  File "/usr/src/app/bi_exporter/bumblebi/exporter/main.py", line 143, in <module>
    main(config_cl)
  File "/usr/src/app/bi_exporter/bumblebi/exporter/main.py", line 124, in main
    worker.run()
  File "/usr/local/lib/python3.12/site-packages/kombu/mixins.py", line 174, in run
    for _ in self.consume(limit=None, **kwargs):
  File "/usr/local/lib/python3.12/site-packages/kombu/mixins.py", line 196, in consume
    conn.drain_events(timeout=safety_interval)
  File "/usr/local/lib/python3.12/site-packages/kombu/connection.py", line 341, in drain_events
    return self.transport.drain_events(self.connection, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kombu/transport/pyamqp.py", line 171, in drain_events
    return connection.drain_events(**kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/amqp/connection.py", line 526, in drain_events
    while not self.blocking_read(timeout):
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/amqp/connection.py", line 532, in blocking_read
    return self.on_inbound_frame(frame)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/amqp/method_framing.py", line 77, in on_frame
    callback(channel, msg.frame_method, msg.frame_args, msg)
  File "/usr/local/lib/python3.12/site-packages/amqp/connection.py", line 538, in on_inbound_method
    return self.channels[channel_id].dispatch_method(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/amqp/abstract_channel.py", line 156, in dispatch_method
    listener(*args)
  File "/usr/local/lib/python3.12/site-packages/amqp/channel.py", line 1629, in _on_basic_deliver
    fun(msg)
  File "/usr/local/lib/python3.12/site-packages/kombu/messaging.py", line 656, in _receive_callback
    return on_m(message) if on_m else self.receive(decoded, message)
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/kombu/messaging.py", line 622, in receive
    [callback(body, message) for callback in callbacks]
     ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/bi_exporter/bumblebi/exporter/main.py", line 87, in process_task
    bi_id = body.get('organization_bi_id')
            ^^^^^^^^
AttributeError: 'str' object has no attribute 'get'

Additional context Kubectl Pods

kubectl get pods
NAME                                                         READY   STATUS             RESTARTS   AGE
arcee-6557f78d75-hqbtj                                       0/1     CrashLoopBackOff   3646       13d
arcee-cd9c89f66-vk84h                                        0/1     CrashLoopBackOff   18         76m
auth-6db59b6b4c-mv926                                        1/1     Running            0          13d
bi-exporter-6c8568b99d-pf45g                                 0/1     CrashLoopBackOff   10         33m
bi-scheduler-1732196100-rx2vn                                0/1     Completed          0          3m55s
booking-observer-scheduler-1732196280-ctd2n                  0/1     Completed          0          49s
booking-observer-worker-7dfcbfb74d-hpmn5                     1/1     Running            0          13d
booking-observer-worker-7dfcbfb74d-j7ccs                     1/1     Running            0          13d
bulldozer-api-5bf7b49f9b-hw5qh                               1/1     Running            0          13d
bulldozerworker-5549cd77f5-6ldjq                             0/1     Init:4/6           0          13d
bulldozerworker-7d59f5c444-f8t7j                             0/1     Init:4/6           0          13d
bumischeduler-64479c4dd7-nk8cm                               1/1     Running            0          13d
bumiworker-6cd5584b9f-zd4ft                                  1/1     Running            0          13d
calendar-observer-scheduler-1732194000-cmlzc                 0/1     Completed          0          38m
calendar-observer-worker-b8db68bff-rjj58                     1/1     Running            0          13d
cleaninfluxdb-1731801600-hbwr6                               0/1     Completed          0          4d13h
cleanmongodb-1732196160-bkcz9                                0/1     Completed          0          2m44s
clickhouse-0                                                 1/1     Running            0          20d
clickhouse-backup-1731722400-47skt                           0/1     Completed          0          5d11h
clickhouse-backup-1731895200-7lxpm                           0/1     Completed          0          3d11h
clickhouse-backup-1732068000-jv9sq                           0/1     Completed          0          35h
demoorgcleanup-1732147200-d5wmv                              0/1     Completed          0          13h
diproxy-bfdd7bc89-jkxld                                      1/1     Running            0          13d
diworker-5454bf99bb-g5b4n                                    1/1     Running            0          13d
dns-test                                                     0/1     Completed          1          7d23h
elk-0                                                        1/1     Running            2          70d
error-pages-7fb76dc697-dw47f                                 1/1     Running            2          70d
etcd-0                                                       1/1     Running            0          13d
etcd-operator-etcd-operator-etcd-operator-5c8d485cfb-6fctr   1/1     Running            12         2d1h
failed-imports-dataset-generator-1732147200-v5s6j            0/1     Completed          0          13h
gemini-scheduler-1732196100-mh777                            0/1     Completed          0          3m55s
gemini-worker-dc996fdb5-fznm5                                1/1     Running            0          13d
grafana-577bb74654-mlcws                                     2/2     Running            0          13d
herald-executor-6c6c5d7644-95phh                             1/1     Running            0          13d
heraldapi-5d9d6dbc9c-mztdf                                   1/1     Running            0          7d
heraldengine-59dd9764f4-dn6g8                                1/1     Running            0          7d
heraldengine-59dd9764f4-wspjq                                1/1     Running            0          7d
influxdb-0                                                   1/1     Running            0          13d
insider-api-778cc57cff-scrx9                                 1/1     Running            0          13d
insider-scheduler-1732147200-bqlrd                           0/1     Completed          0          13h
insider-worker-6fd6bf99d4-d42gp                              1/1     Running            0          13d
jira-bus-6ddc96f589-6hj86                                    1/1     Running            0          13d
jira-ui-cc875dff6-wkd8c                                      1/1     Running            0          13d
kataraapi-bfdbf756-475jt                                     1/1     Running            0          13d
katarascheduler-645c49b5d5-qn95h                             1/1     Running            0          13d
kataraworker-65869dd8d5-2x2jm                                1/1     Running            0          13d
keeper-9c7c99b8d-lsbp7                                       1/1     Running            0          13d
keeper-executor-6df6565bf5-2gbbf                             1/1     Running            0          13d
layout-cleaner-1726196400-vr7zf                              0/1     Completed          0          69d
live-demo-generator-scheduler-1732194000-9chmd               0/1     Completed          0          38m
live-demo-generator-worker-7df745fc4f-749vq                  1/1     Running            0          13d
mariadb-0                                                    1/1     Running            0          13d
mariadb-backup-1732068000-6m9gf                              0/1     Completed          0          35h
mariadb-backup-manual-5zi01-5zhpg                            0/1     Completed          0          2d22h
mariadb-backup-manual-s29sm-5qmp7                            0/1     Completed          0          2d23h
metroculusapi-544bbfbddc-vr47c                               1/1     Running            0          13d
metroculusscheduler-1732195800-bvpk5                         0/1     Completed          0          8m45s
metroculusworker-b9fd4598d-5lrvl                             1/1     Running            0          13d
minio-0                                                      1/1     Running            1          70d
minio-backup-1731722400-vh8mz                                0/1     Completed          0          5d11h
minio-backup-1731895200-n9g5g                                0/1     Completed          0          3d11h
minio-backup-1732068000-ccl9w                                0/1     Completed          0          35h
mongo-0                                                      1/1     Running            0          13d
mongodb-backup-1731722400-qtfpf                              0/1     Completed          0          5d11h
mongodb-backup-1731895200-w4b57                              0/1     Completed          0          3d11h
mongodb-backup-manual-uhylv-xxmjc                            0/1     Completed          0          2d20h
myadmin-c4dc4c997-rcjrb                                      1/1     Running            1          70d
ngingress-nginx-ingress-controller-vskf5                     1/1     Running            0          68d
ngingress-nginx-ingress-default-backend-678cc96456-7qb7f     1/1     Running            2          68d
ngui-56667b9788-xscj9                                        1/1     Running            0          7d17h
ohsu-798f59d4b-f4jbh                                         1/1     Running            0          13d
organization-violations-scheduler-1732196100-gvvr7           0/1     Completed          0          3m54s
organization-violations-worker-788fb79c99-4rr5n              1/1     Running            0          13d
pharos-receiver-66c895dc7c-j8bl6                             1/1     Running            0          13d
pharos-worker-dc99b7fb8-k4twd                                1/1     Running            0          13d
power-schedule-scheduler-1732196100-8w5cw                    0/1     Completed          0          3m54s
power-schedule-worker-769d6b58c6-v27xr                       1/1     Running            0          13d
pre-configurator-7g5pj                                       0/1     Completed          0          9d
rabbitmq-0                                                   1/1     Running            0          20d
redis-0                                                      1/1     Running            0          13d
report-import-scheduler-0-1732195800-xsvst                   0/1     Completed          0          8m44s
report-import-scheduler-1-1728932400-trxql                   0/1     Completed          0          37d
report-import-scheduler-1-1728936000-2vg88                   0/1     Init:0/1           0          37d
report-import-scheduler-24-1732147200-lp2wz                  0/1     Completed          0          13h
report-import-scheduler-6-1732190400-mksgf                   0/1     Completed          0          98m
resource-discovery-scheduler-1732196100-58g5n                0/1     Completed          0          3m53s
resource-discovery-worker-54bd849796-dfgbg                   1/1     Running            9          13d
resource-discovery-worker-54bd849796-wjrhc                   1/1     Running            8          13d
resource-observer-scheduler-1730465100-4n5l5                 0/1     Completed          0          20d
resource-observer-scheduler-1730465400-vl229                 0/1     Init:0/1           0          20d
resource-observer-worker-8b4c958f4-p7gnw                     1/1     Running            0          13d
resource-violations-scheduler-1732196100-7lwzf               0/1     Completed          0          3m53s
resource-violations-worker-7cbb959dcc-htk6q                  1/1     Running            0          13d
restapi-5b45894555-zcvd6                                     1/1     Running            0          7d16h
risp-scheduler-1732067100-gnh89                              0/1     Completed          0          35h
risp-scheduler-1732070700-6gt4d                              0/1     Init:2/3           0          34h
risp-worker-7ccdb86f5d-v85lz                                 1/1     Running            0          13d
slacker-executor-8545d6df77-zfmnw                            1/1     Running            0          13d
slacker-fbd96967f-hqsps                                      1/1     Running            0          13d
thanos-compactor-1732190400-7vr8x                            0/1     Completed          0          98m
thanos-query-8776c6dcc-qfmk4                                 1/1     Running            1          70d
thanos-receive-0                                             1/1     Running            1          70d
thanos-storegateway-0                                        1/1     Running            2          70d
thanos-web-0                                                 1/1     Running            1          70d
trapper-scheduler-1732193100-f2n86                           0/1     Completed          0          53m
trapper-worker-5fd546588c-r79bj                              1/1     Running            0          13d
users-dataset-generator-1732147200-vvs7p                     0/1     Completed          0          13h
webhook-executor-5c55487bd6-p7tbq                            1/1     Running            0          13d

etcd-0 locks:

sh: bash: not found
/ # etcdctl ls _locks
/_locks/arcee_migrations
/_locks/diworker_migrations
/_locks/jira_bus_migrations
/_locks/bulldozer_migrations
/_locks/gemini_migrations
/_locks/insider_migrations
/_locks/metroculus_migrations
/_locks/restapi_migrations
/_locks/risp_migrations
/_locks/slacker_migrations
/ # etcdctl ls _locks/arcee_migrations
/ #
tm-hystax commented 7 hours ago

Hi, I followed the provided steps and did not encounter your error! All pods are in the correct states.

Please try to restart your cluster (your data won't be lost).

  1. Remove cluster

    ./runkube.py -d -- cloudcose-deployment 2024110701-public
  2. Start cluster

    ./runkube.py --with-elk -o overlay/user_template.yml -- cloudcose-deployment 2024110701-public
GiovanniYuriMita commented 4 hours ago

Hi @tm-hystax,

Thank you soo much for the help! I really don't know what was causing those issues but this simple guide solved it.

Thank you so much, guys!

Wish you all the best!