ibm-mas / ansible-devops

Ansible collection supporting devops for IBM Maximo Application Suite
https://ibm-mas.github.io/ansible-devops/
Eclipse Public License 2.0
48 stars 80 forks source link

[Watson Discovery][ROKS Cluster] WD ranker rest pod unable to start #1280

Closed rene-oromtz closed 2 months ago

rene-oromtz commented 3 months ago

Summary

wd-discovery-ranker-rest keeps getting on crash loop back off state, WD keeps in InProgress status. From inside the pod, there is a JIT COMPILER CRASH WITH VMSTATE=0x00040000

Steps to reproduce

  1. Install CP4D from IBM Maximo curated catalog
  2. Install Watson Discovery following Maximo procedure.
  3. Wait for all deployments to complete
  4. wd-discovery-ranker-rest should be the only deployment that is pending to complete

What is the current bug behavior?

WD is on Ready state in ROKS cluster but not all deployments are healthy

What is the expected correct behavior?

WD should be Ready and all deployment should have the necessary pod availability

Relevant logs and/or screenshots

CPD Version:

$ oc get ibmcpd -n ibm-cpd -o yaml|grep version:
    version: 4.6.6

WD

$oc get wd wd -n ibm-cpd
NAME   VERSION   READY   READYREASON   UPDATING   UPDATINGREASON   DEPLOYED   VERIFIED   QUIESCE        DATASTOREQUIESCE   AGE
wd     4.6.5     True    Stable        True       VerifyWait       22/22      21/22      NOT_QUIESCED   NOT_QUIESCED       5h44m
$ oc get csv -n ibm-cpd-operators|egrep "discovery|NAME"
NAME                                      DISPLAY                                VERSION    REPLACES                                  PHASE
ibm-watson-discovery-operator.v5.5.0      Watson Discovery                       5.5.0                                                Succeeded
$ oc get sub -n ibm-cpd-operators|egrep "wd|NAME"
NAME                                                                           PACKAGE                         SOURCE                 CHANNEL
cpd-wd-operator                                                                ibm-watson-discovery-operator   ibm-operator-catalog   v5.5

WD Deployed Components:

$ oc get wd wd -n ibm-cpd -o yaml
...
    deployedComponents:
    - api
    - certmanager
    - cnm
    - commonconfig
    - coreapi
    - edbcnpostgres
    - elasticsearchcluster
    - etcd
    - foundation
    - hdp
    - ingestion
    - miniocluster
    - orchestrator
    - query
    - rabbitmqcluster
    - rbac
    - sdu
    - statelessapi
    - tooling
    - watsongateway
    - wire
    - wksml
    failedComponents: []
    unverifiedComponents:
    - wire
    verified: 21/22
...

WatsonDiscoveryWireCR:

$ oc get WatsonDiscoveryWire wd -n ibm-cpd -o yaml
...
    deployedComponents:
    - haywire
    - jks_secret
    - postgres_secret_gen
    - project_data_prep_agent
    - ranker_master
    - ranker_monitor_agent
    - ranker_rest
    - serve_ranker
    - training_agents
    - training_crud
    - training_rest
    - wireconfig
    failedComponents: []
    unverifiedComponents:
    - ranker_rest
    verified: 11/12
...

WD Ranker Rest Image:

$  oc describe deploy wd-discovery-ranker-rest -n ibm-cpd | grep -i image
    Image:      cp.icr.io/cp/watson-discovery/wd-utils:14.6.4-63@sha256:c295fa3caaf166566d37a579f8e81b1b880b990359619685f4249f60dc332a1e
    Image:      cp.icr.io/cp/watson-discovery/discovery-ranker-rest-service:20230407-002358-4-b68be44@sha256:f9ef8d407d5de6c0d1bde5e3b0ed0b99fbfa36a392c0c5c8b7d2553a2297b30c

WD Ranker Rest Logs: full log: wd-discovery-ranker-rest-7c8894444c-l67mn-wd-discovery-ranker-rest.log

Unhandled exception
Type=Segmentation error vmState=0x00040000
J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
Handler1=00007F3C24E10830 Handler2=00007F3C246BCCD0 InaccessibleAddress=00007F3BE1BC2C80
RDI=0000000000014200 RSI=0000000000C64130 RAX=0000000000000000 RBX=0000000000000010
RCX=00007F3C22396D90 RDX=0000000000000000 R8=00007F3C2001B990 R9=0000000000000000
R10=0000000000000000 R11=00007F3BE1BC2C80 R12=0000000000000001 R13=0000000000000008
R14=00007F3C25D51D00 R15=00007F3C24EDBAF4
RIP=00007F3BE1BC2C80 GS=0000 FS=0000 RSP=00007F3C25D51B18
EFlags=0000000000010246 CS=0033 RBP=00007F3C25D51B20 ERR=0000000000000014
TRAPNO=000000000000000E OLDMASK=0000000000000000 CR2=00007F3BE1BC2C80
xmm0 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm1 6e6f697372655672 (f: 1919243904.000000, d: 9.083668e+223)
xmm2 00007f3c25d51fd0 (f: 634724288.000000, d: 6.911796e-310)
xmm3 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm4 43e0000000000000 (f: 0.000000, d: 9.223372e+18)
xmm5 000000003dae8a3b (f: 1034848832.000000, d: 5.112833e-315)
xmm6 0000000048bb7560 (f: 1220244864.000000, d: 6.028811e-315)
xmm7 000000000003a660 (f: 239200.000000, d: 1.181805e-318)
xmm8 0000000000c69ca0 (f: 13016224.000000, d: 6.430869e-317)
xmm9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm10 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm11 3ff0000000000000 (f: 0.000000, d: 1.000000e+00)
xmm12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
xmm13 0000000047330713 (f: 1194526464.000000, d: 5.901745e-315)
xmm14 00000000490f0490 (f: 1225720960.000000, d: 6.055866e-315)
xmm15 000000004764d420 (f: 1197790208.000000, d: 5.917870e-315)
Target=2_90_20230313_47323 (Linux 4.18.0-477.27.1.el8_8.x86_64)
CPU=amd64 (32 logical CPUs) (0x1f78997000 RAM)

WD deployments:

$ oc get deployment -n ibm-cpd|egrep "wd-|NAME"
NAME                                       READY   UP-TO-DATE   AVAILABLE   AGE
wd-discovery-cnm-api                       1/1     1            1           5h19m
wd-discovery-converter                     1/1     1            1           5h57m
wd-discovery-crawler                       1/1     1            1           5h57m
wd-discovery-entity-suggestion             1/1     1            1           5h19m
wd-discovery-entity-training               1/1     1            1           5h19m
wd-discovery-gateway                       1/1     1            1           4h11m
wd-discovery-glimpse-builder               1/1     1            1           5h17m
wd-discovery-glimpse-query                 1/1     1            1           5h19m
wd-discovery-haywire                       1/1     1            1           5h14m
wd-discovery-hdp-rm                        1/1     1            1           5h15m
wd-discovery-ingestion-api                 1/1     1            1           5h15m
wd-discovery-inlet                         1/1     1            1           5h57m
wd-discovery-management                    1/1     1            1           5h15m
wd-discovery-minerapp                      1/1     1            1           4h18m
wd-discovery-orchestrator                  1/1     1            1           5h57m
wd-discovery-outlet                        1/1     1            1           5h57m
wd-discovery-po-box                        1/1     1            1           5h57m
wd-discovery-project-data-prep-agent       1/1     1            1           5h14m
wd-discovery-ranker-master                 1/1     1            1           5h14m
wd-discovery-ranker-monitor-agent          1/1     1            1           5h14m
wd-discovery-ranker-rest                   0/1     1            0           5h14m
wd-discovery-rapi                          1/1     1            1           5h15m
wd-discovery-serve-ranker                  1/1     1            1           5h14m
wd-discovery-stateless-api-model-runtime   1/1     1            1           5h57m
wd-discovery-stateless-api-rest-proxy      1/1     1            1           5h57m
wd-discovery-support                       0/0     0            0           5h15m
wd-discovery-tooling                       1/1     1            1           4h18m
wd-discovery-training-agents               1/1     1            1           5h14m
wd-discovery-training-crud                 1/1     1            1           5h14m
wd-discovery-training-rest                 1/1     1            1           5h14m
wd-discovery-watson-gateway-gw-instance    1/1     1            1           4h8m
wd-discovery-wd-indexer                    1/1     1            1           5h57m
wd-discovery-wksml                         1/1     1            1           5h57m
wd-minio-discovery-kes                     1/1     1            1           5h52m

This issue seems to be only relevant for ROKS cluster.

This issue has been originally opened against CPD for data team, however, CPD team found the problem was with the image for ranker-rest and dev was able to fix it by patching this image with the latest for 4.6.6:

oc patch wd wd --type=merge --patch='{"spec": {"wire": {"rankerRest": {"image": {"tag":"20240223-012927-645-b14d8a8","digest":"sha256:89e2c3efffbb06eb1605fb6b3b550ca7e0a41bc88f9e0a0d78c5727a54ff9635"}}}}}'

CPD team mentioned this workaround be documented on CPD Docs, as this issue is not reproducible in other clusters by using CPD install procedure, so this might be directly related with Maximo and the images used during installation.

As the correct image for the ranker-rest already was provided, I'm wondering if this workaround can be documented in Maximo side and which place should be ideal for this purpose. Maximo Assist Troubleshooting section might be suitable for this purpose as this was encounter during Assist configuration.

durera commented 3 months ago

We will publish a KT for this "Known issue with wd-discovery-ranker-rest in CPD 4.6.6" or somesuch. Watson Discovery has been problematic since it's introduction to the dependency stack, and is due to be removed from it in an upcoming release.

istrate commented 2 months ago

DT and Techdoc available at: https://www.ibm.com/support/pages/node/7149820

rene-oromtz commented 2 months ago

Thanks for the update @istrate @durera!

rene-oromtz commented 2 months ago

@istrate I think I just overlooked, is it just me or in https://www.ibm.com/support/pages/node/7149820 the patch is cropped? Not sure if its in my browser but the patch looks as follows:

oc patch wd wd --type=merge --patch='{"spec": {"wire": {"rankerRest": {"image": {"tag":"20240223-012927-645-b14d8a8","digest":"sha256:89e2c3efffbb06eb1605fb6b3b550ca7e0a41bc88f9e0a0d78c5

However in the know issue DT381121 I do see the full patch:

oc patch wd wd --type=merge --patch='{"spec": {"wire": {"rankerRest": {"image": {"tag":"20240223-012927-645-b14d8a8","digest":"sha256:89e2c3efffbb06eb1605fb6b3b550ca7e0a41bc88f9e0a0d78c5727a54ff9635"}}}}}'

istrate commented 2 months ago

@rene-oromtz Thanks! I have updated that with the correct patch command.