ibm-mas / ansible-devops

Ansible collection supporting devops for IBM Maximo Application Suite
https://ibm-mas.github.io/ansible-devops/
Eclipse Public License 2.0
49 stars 89 forks source link

Enhance CP4D playbooks to fix critical performance issues #239

Closed lokeshs09 closed 2 years ago

lokeshs09 commented 2 years ago

Opening an issue here make the necessary fixes or enhancements to the cp4d playbooks on anisble-devops. The recent breakdown, blockers on IVT10 and IVT11 were caused by 2 main factors.

Details of the issue is been documented here: https://github.ibm.com/PrivateCloud-analytics/CPD-Quality/issues/2326

Copying Sriram's recommendations here:

Summary of the issues:

1) Performance issue: - suspected storage IOPs (with ibmc-file-gold-gid) or even perhaps connection leaks - while one cluster perfoms better after scale out/up, the second cluster still has problems (@rahul-shinge & @kvstumph reviewing the second cluster)

2) Issue with "ibm-operator-catalog" latest in use - this can cause uncontrolled automatic upgrades & will be hard for CPD to support if there are an arbitrary mix-n-match of versions (including Cloud Pak Foundational Services version)


(future) Action Items:

1) Right storage selection (especially on IBM Cloud) to improve reliability

When provisioning Ibmcpd CR - add zenCoreMetadbStorageClass to point to a block storage class

apiVersion: cpd.ibm.com/v1
kind: Ibmcpd
metadata:
  name: ibmcpd-cr
  namespace: cpd-instance
spec:
  license:
    accept: true
    license: Enterprise
  storageClass:  <rwx-storage-such-as-ibmc-file-gold-gid> 
 zenCoreMetadbStorageClass:  <block-storage>

2) Validating performance of the available storage classes CPD now also has tools (published to open source) to measure/benchmark target storage See: https://github.com/IBM/k8s-storage-perf

3) Freeze the CP4D version - so it does not get randomly upgraded whenever any refresh happens for stability - and important to be on a “validated” version combination

using fixed catalog sources (image digests) instead of ibm-operator-catalog https://www.ibm.com/docs/en/cloud-paks/cp-data/4.0?topic=ccs-creating-catalog-sources-that-pull-specific-versions-images-from-entitled-registry

CPD is introducing additional automation to reduce complexity of installs and upgrades while pinning versions: https://github.ibm.com/PrivateCloud/olm-utils

4) Use LDAP/AD even for testing environments (or Cloud Pak IAM) to mimic “enterprise” security — the out-of-the-box placeholder is not secure enough or recommended for use. Once Authentication is configured, most customers turn off even the "admin" user: https://www.ibm.com/docs/en/cloud-paks/cp-data/4.0?topic=users-disabling-default-admin-user (using the out-of-the-box usermgmt is ok only for dev/test purposes)

lokeshs09 commented 2 years ago

Comments from Rahul

I think this cluster has resource issues

  1. Changed to use ibm-block-gold stroage class with scale=medium DB response was slow
  2. Changed limits for metastoredb CPU=2 / Memory=4GB DB response is improved but not fast Page loads and shows projects and other artifacts
  3. ibm-zen-operator is in maintenance mode ignoreForMaintenance: true
  4. Lite-cr is updated to use block storage zenCoreMetaDbStorageClass: ibmc-block-gold
lokeshs09 commented 2 years ago

Link to Slack conversation: https://ibm-watson-iot.slack.com/archives/C0195MVCEUD/p1648836704013219

andrercm commented 2 years ago

I fixed the zenmetastoredb storage class to be ibmc-block-gold but what seemed to have resolved in fact was to boost the default zen-metastoredb statefulset to use more mem/cpu: https://github.com/ibm-mas/ansible-devops/blob/master/ibm/mas_devops/roles/cp4d_install/tasks/install/cpd40.yml#L123

I still have opened questions regarding the need to set the cpd installs to manual instead of automatic upgrades... this will likely cause troubles to more places in the ansible collection because if we set CPD to manual upgrades, all subscriptions under ibm-common-services will be forced to be manually managed as well.

durera commented 2 years ago

Auto-upgrade does not affect the cp4d product version (4.x), only the operator versions are affected by OLM subscriptions so we can close this based on the work @andrercm has already performed.