ibm-messaging / mq-container

Container images for IBM® MQ
Apache License 2.0
255 stars 189 forks source link

gskcapicmd_64 100% cpu usage and freeze start mq container on AMD CPUs #462

Open kalekhin opened 3 years ago

kalekhin commented 3 years ago

Hello!

When my laptop is connected to a power source, the gskcapicmd_64 call lasts forever, and one thread uses 100% of the processor. This is fixed if the laptop is powered by a battery. I don't understand why and how this is related.

Reproduced on ROG Zephyrus G15 GA503 GA503QM-HN094 (AMD Ryzen 7 5800HS with Radeon). Env: Win 10 Pro 20H2 build 19042.1110 with all updates for 17/07/2021 Docker toolbox v19.03.1 is running on a virtual box 6.1.22 r144080 (Qt5.6.2) with the extension package 6.1.22 r144080.

Just run:

docker run \
--env LICENSE=accept \
--env MQ_QMGR_NAME=QM1 \
--publish 1414:1414 \
--publish 9443:9443 \
--detach \
ibmcom/mq

And the bug will happen. image

https://github.com/ibm-messaging/mq-container/blob/4580cecf4973107dff184e8cbbcf9ac7f5b4e7df/internal/keystore/keystore.go#L192

kalekhin commented 3 years ago

The problem is solved if I provide all the processor cores for the linux virtual machine where docker is installed, or if I setup the paravirtualization interface for this machine as hyper-v.

In the first solution, the CPU load is 100% already by the java process in the mq container. But restarting the container has a chance to solve the problem (only a chance). The second solution looks more stable. So far, there have been no problems

I am not sure that any solution is stable.

jfmatheusg commented 3 years ago

image

Same problem here :( Im running the container on Ubuntu 20.04 LTS, AMD Ryzen 7 5800, Lenovo Legion 5 Pro.

mihmig commented 3 years ago

I have a similar situation - 100% CPU load by gsk8capicmd_64: image my hardware/software set:: AMD Ryzen 7 5800H Windows 10.0.19042.1237 with WSL 2 with core version 5.10.16 Docker 20.10.8 build 3967b7d

andreysaksonov commented 2 years ago

AMD Ryzen 7 5800H (Lenovo Legion 5) Fedora 36 (kernel 5.17.12-300.fc36.x86_64), Docker Desktop 4.9.0 (docker 20.10.16)

same issue: /opt/mqm/gskit8/bin/gsk8capicmd_64 -keydb -create -type cms -db /run/runmqserver/tls/key.kdb -pw JtM27EP7L2LH -stash loads CPU 100% and takes from 6 to 40 minutes (randomly)

image

@arthurbarr could you escalate this? this is productivity killer for developers working on AMD Ryzen setups, I guess original issue coming from some GSKit8 bug

parrobe commented 2 years ago

Hi @andreysaksonov , @mihmig , @jfmatheusg , @kalekhin .

Arthur has asked me to look into this. The gsk8capicmd is owned by a separate internal IBM team to IBM MQ but i can raise a support ticket with them to ask them to take a look. To help them diagnose the issue they are likely to want trace of the issue.

Please could you run the same commands as before that caused the 100% CPU issue with the -trace <file> option. For example, taking @andreysaksonov 's command i would run: /opt/mqm/gskit8/bin/gsk8capicmd_64 -keydb -create -type cms -db /run/runmqserver/tls/key.kdb -pw JtM27EP7L2LH -stash -trace /tmp/trace.output.

Please then send me the file generated; in the previous example i would need /tmp/trace.output. You can send the file by either attaching it to a comment here or directly via email to parrobe@uk.ibm.com.

Could you also let me know what version of MQ you are using, this is best done via the dspmqver command and you can also tell me directly what version of GSKit you are using via dspmqver -p 65

In the meantime I'll get the ball rolling with GSKit and hopefully we can get to the bottom of this.

andreysaksonov commented 2 years ago

@parrobe

docker rm ibmmq && docker run -e LICENSE=accept -e DEBUG=true -e MQ_QMGR_NAME=QM1 -p 9443:9443 --name ibmmq icr.io/ibm-messaging/mq:latest

❯ docker exec -it ibmmq /bin/bash
bash-4.4$ ps auxwwf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
1001        49  0.1  0.0  35120  4284 pts/0    Ss   14:45   0:00 /bin/bash
1001        55  0.0  0.0  47616  3544 pts/0    R+   14:45   0:00  \_ ps auxwwf
1001         1  0.5  0.1 1290456 14472 ?       Ssl  14:45   0:00 runmqserver -nologruntime -dev
1001        43  0.0  0.0  34988  4068 ?        S    14:45   0:00 /bin/sh /opt/mqm/bin/runmqakm -keydb -create -type cms -db /run/runmqserver/tls/key.kdb -pw zxDekBysHQ7S -stash
1001        48  100  0.0  46816 11652 ?        R    14:45   0:15  \_ /opt/mqm/gskit8/bin/gsk8capicmd_64 -keydb -create -type cms -db /run/runmqserver/tls/key.kdb -pw zxDekBysHQ7S -stash
bash-4.4$ /bin/sh /opt/mqm/bin/runmqakm -keydb -create -type cms -db /run/runmqserver/tls/key.kdb -pw zxDekBysHQ7S -stash -trace /tmp/zxDekBysHQ7S.trace
bash-4.4$ date
Tue Jun  7 14:46:05 UTC 2022
bash-4.4$ /bin/sh /opt/mqm/bin/runmqakm -keydb -create -type cms -db /run/runmqserver/tls/key.kdb -pw zxDekBysHQ7S -stash -trace /tmp/zxDekBysHQ7S.trace
CTGSK3036W The output file "/run/runmqserver/tls/key.kdb" already exists.

bash-4.4$ date
Tue Jun  7 14:46:22 UTC 2022
bash-4.4$ exit
exit
❯ docker cp ibmmq:/tmp/zxDekBysHQ7S.trace .
❯ docker exec -it ibmmq /bin/bash
bash-4.4$ dspmqver
Name:        IBM MQ
Version:     9.2.5.0
Level:       p925-L220207-CSU01-L220405.DE
BuildType:   IKAP - (Production)
Platform:    IBM MQ for Linux (x86-64 platform)
Mode:        64-bit
O/S:         Linux 5.10.104-linuxkit
O/S Details: Red Hat Enterprise Linux 8.6 (Ootpa)
InstName:    Installation1
InstDesc:    IBM MQ V9.2.5.0 (Unzipped)
Primary:     N/A
InstPath:    /opt/mqm
DataPath:    /mnt/mqm/data
MaxCmdLevel: 925
LicenseType: Developer
bash-4.4$ dspmqver -p 65
Name:        IBM MQ
Version:     9.2.5.0
Level:       p925-L220207-CSU01-L220405.DE
BuildType:   IKAP - (Production)
Platform:    IBM MQ for Linux (x86-64 platform)
Mode:        64-bit
O/S:         Linux 5.10.104-linuxkit
O/S Details: Red Hat Enterprise Linux 8.6 (Ootpa)
InstName:    Installation1
InstDesc:    IBM MQ V9.2.5.0 (Unzipped)
Primary:     N/A
InstPath:    /opt/mqm
DataPath:    /mnt/mqm/data
MaxCmdLevel: 925
LicenseType: Developer

AMQ8250I: The 32-bit GSKit component is not installed.

Name:        IBM Global Security Kit for IBM MQ
Version:     8.0.55.26
BuildType:   Production
Mode:        64-bit
bash-4.4$ 

zxDekBysHQ7S.zip

As you can see drama of the situation is that when it is not spawned by runmqserver -nologruntime -dev but instead I run it from new shell in container - the command does not hang. Attached trace file anyway

parrobe commented 2 years ago

Thanks @andreysaksonov - I've passed these details onto the GSKit team. I will let you know when they have responded.

parrobe commented 2 years ago

Hi @andreysaksonov - We've heard back from GSKit now. They suspect this is an issue they have seen with some AMD processors where their RNG module hangs due to a diference in the AMD chips clock. They have asked if we can retry with the following environment variable set as a workaround to see if the issue resolves: Please set ICC_SHIFT=3 when creating your container so it is present for the container startup. Please run trace again if the issue has not resolved.

WARNING: For anyone reading this issue and solution who is seeing a similar issue to above, please do not set the environment variable unless specifically advised to do so as while the variable may resolve the issue, it can negatively impact performance or functionality if the issue it is trying to resolve is not the cause for your particular problem.

andreysaksonov commented 2 years ago

Yes, it solves the issue, thanks. I will leave link to original GSKit bug: https://www.ibm.com/support/pages/apar/IJ28497

andyedwardsibm commented 5 months ago

We've started hitting this in App Connect in containers too.

When considering the comment above from parrobe, (that basically says "you must set this to fix the problem, but must not set it if you don't see the problem) we've identified an interesting problem when running in a cluster. Initially, one might think that it's safe to set ICC_SHIFT on a cluster, and so we could add a config parameter for an operand so that users can selectively set ICC_SHIFT

But the cluster may have a mix of hardware for the workers. Some may use AMD chips susceptible to this problem, some may use Intel chips. This would mean that it depends what worker a Pod lands on as to whether the user should set the config parameter or not, and that Pod placement is, by default, random

imavo commented 2 months ago

I understand that there has been an update (fixpack) to the IBM GSKit component that addresses this cpu specific-issue, so you could contact IBM support to get the updated GSKit and apply it to your estate, or instead wait for the next fixpack update to you rmq component(s) to include the updated GSKit . Hopefully that would then mean a permanent fix without needing workarounds like ICC_SHIFT=3

parrobe commented 2 months ago

Hi @imavo - IBM MQ aims to update all our thirdparty component versions to the latest with each released version of IBM MQ. Unfortunately, there are delays taking the latest GSKit version due to issues found through our regression testing. We will update as soon as we are able.