ibmruntimes / Semeru-Runtimes

Issue repo for all things IBM Semeru Runtimes
14 stars 4 forks source link

InstantOn checkpoint failed on ZLinux with EA Java21 Liberty 24.0.0.1 UBI9-min #67

Closed tam512 closed 9 months ago

tam512 commented 11 months ago

Perform checkpoint on ZLinux VM

sudo podman build --build-arg BASE_IMAGE=$INSTANTON_BASE_IMAGE --build-arg HYCSVT=$HYCSVT --build-arg CHECKPOINT=$checkpoint -t $APPLICATION/$checkpoint:"${tag}" --cap-add=CHECKPOINT_RESTORE --cap-add=SYS_PTRACE --cap-add=SETPCAP --security-opt seccomp=unconfined -f BuildFiles/$APPLICATION/Containerfile --no-cache --volume /libertyrepo:/opt/libertyrepo:z ./BuildFiles/$APPLICATION

and it failed with the following


[AUDIT   ] CWWKC0451I: A server checkpoint "afterAppStart" was requested. When the checkpoint completes, the server stops.

Unhandled exception

Type=Segmentation error vmState=0x00000000

J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001

Handler1=000003FF96C49958 Handler2=000003FF96B319D8 InaccessibleAddress=0000000000000000

gpr0=0000000000053B63 gpr1=0000000000000000 gpr2=0000000000000000 gpr3=0000000000000000

gpr4=000003FF00000000 gpr5=000003FF94A7E0B0 gpr6=000000000000006E gpr7=0000000000000002

gpr8=0000000000000000 gpr9=000003FF4802ACB0 gpr10=000003FF903EEE48 gpr11=000003FF4802AC88

gpr12=000003FF785E4370 gpr13=000003FF94A7F840 gpr14=000003FF9751DAE8 gpr15=000003FF94A7DE58

psw=000003FF9751A1AE mask=0705200180000000 fpc=00000000 bea=000003FF9751DAE2

fpr0 0000000000000000 (f: 0.000000, d: 0.000000e+00)
...............
..........................
Module=/lib64/libc.so.6

Module_base_address=000003FF97480000 Symbol=_pthread_cleanup_pop

Symbol_address=000003FF9751A1A0

Target=2_90_20231017_35 (Linux 5.14.0-284.11.1.el9_2.s390x)

CPU=s390x (2 logical CPUs) (0xcb8df000 RAM)

----------- Stack Backtrace -----------

_pthread_cleanup_pop+0xe (0x000003FF9751A1AE [libc.so.6+0x9a1ae])

pthread_cond_timedwait+0x298 (0x000003FF9751DAE8 [libc.so.6+0x9dae8])

monitor_wait_original+0x654 (0x000003FF96A882EC [libj9thr29.so+0x82ec])

omrthread_monitor_wait_interruptable+0x6a (0x000003FF96A8BC12 [libj9thr29.so+0xbc12])

timeCompensationHelper+0x28c (0x000003FF96C98BC4 [libj9vm29.so+0x98bc4])

monitorWaitImpl+0x162 (0x000003FF96C98E5A [libj9vm29.so+0x98e5a])

Fast_java_lang_Object_wait+0x1a (0x000003FF96C4773A [libj9vm29.so+0x4773a])

 (0x000003FF785E40AE [<unknown>+0x0])

---------------------------------------

JVMDUMP039I Processing dump event "gpf", detail "" at 2023/12/12 19:55:00 - please wait.

JVMDUMP032I JVM requested System dump using '/opt/ol/wlp/output/defaultServer/core.20231212.195500.1033.0001.dmp' in response to an event

Unhandled exception

Type=Segmentation error vmState=0x00000000

J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001

Handler1=000003FF96C49958 Handler2=000003FF96B319D8 InaccessibleAddress=0000000000000000
..............

I have sent full log and information to Shubham Verma and Rahil Shah

tam512 commented 10 months ago

The root cause of this issue is this SELinux VM which has enforcing mode. The SELinux limitations is documented here https://openliberty.io/docs/latest/instanton-limitations.html#se

In this case, the checkpoint crashed and prevented Liberty to display the following message which was a fix from https://github.com/OpenLiberty/open-liberty/issues/24522

CWWKE0963E: The server checkpoint request failed because netlink system calls were unsuccessful. If SELinux is enabled in enforcing mode, netlink system calls might be blocked by the SELinux "virt_sandbox_use_netlink" policy setting. Either disable SELinux or enable the netlink system calls with the "setsebool virt_sandbox_use_netlink 1" command. 

We need more investigation on why the crash occurred. It is likely from CRIU.

tam512 commented 9 months ago

Testing with Open Liberty 24.0.0.2/wlp-1.0.86.cl240220240211-1900 container image, I verified that I do not see the crash in this defect when the SELinux VM has _virt_sandbox_use_netlink --> off_, but I see the following expected error message instead:

CWWKE0963E: The server checkpoint request failed because netlink system calls were unsuccessful. If SELinux is enabled in enforcing mode, netlink system calls might be blocked by the SELinux "virt_sandbox_use_netlink" policy setting. Either disable SELinux or enable the netlink system calls with the "setsebool virt_sandbox_use_netlink 1" command.