spring-projects / spring-framework

Spring Framework
https://spring.io/projects/spring-framework
Apache License 2.0
56.09k stars 37.94k forks source link

CRaC restore fails with ClassNotFoundException on Jar path #33226

Closed shmyer closed 1 month ago

shmyer commented 1 month ago

Affects: 6.1.10 JDK: zulu21.34.19-ca-crac-jdk21.0.3-linux_x64 Running on a Linux VM


I am trying to use CRaC with a Spring Boot app. I have come across many issues so far, including logback appenders causing jdk.crac.impl.CheckpointOpenFileException upon checkpoint creation (https://github.com/spring-projects/spring-boot/issues/38548) and the Eureka Discovery Client causing an open connection because of fetching the registry before checkpoint. I was able to workaround those issues so far and I made the checkpointing work.

Now I am stuck on the restore. As you can see in the attached log, the restore code is trying to load my Spring Boot Jar as a class and of course it can't find that. I don't quite understand why it does that.

I've also attached the CRIU dump and restore logs below, they seem fine to me, but I might be wrong.


Spring Boot Log:

24476: Error (criu/tty.c:843): tty: Can't set tty params on 0x26, trying to skip...: Inappropriate ioctl for device
2024-07-17T12:18:51.537Z  INFO 24476 --- [app] [           main] o.s.c.support.DefaultLifecycleProcessor  : Restarting Spring-managed lifecycle beans after JVM restore
2024-07-17T12:18:51.694Z  INFO 24476 --- [app] [           main] o.s.c.support.DefaultLifecycleProcessor  : Spring-managed lifecycle restart completed (restored JVM running for 693 ms)
2024-07-17T12:18:51.762Z  WARN 24476 --- [app] [           main] ConfigServletWebServerApplicationContext : Exception encountered during context initialization - cancelling refresh attempt: org.springframework.context.ApplicationContextException: Failed to restore CRaC checkpoint on refresh
2024-07-17T12:18:51.855Z  INFO 24476 --- [app] [           main] com.netflix.discovery.DiscoveryClient    : Shutting down DiscoveryClient ...
2024-07-17T12:18:54.864Z  INFO 24476 --- [app] [           main] com.netflix.discovery.DiscoveryClient    : Unregistering ...
2024-07-17T12:18:55.297Z  INFO 24476 --- [app] [           main] com.netflix.discovery.DiscoveryClient    : DiscoveryClient_app/<eureka-host>:app:8702 - deregister  status: 404
2024-07-17T12:18:55.301Z  INFO 24476 --- [app] [           main] com.netflix.discovery.DiscoveryClient    : Completed shut down of DiscoveryClient
2024-07-17T12:18:55.322Z  INFO 24476 --- [app] [           main] o.apache.catalina.core.StandardService   : Stopping service [Tomcat]
2024-07-17T12:18:55.373Z  INFO 24476 --- [app] [           main] .s.b.a.l.ConditionEvaluationReportLogger :

Error starting ApplicationContext. To display the condition evaluation report re-run your application with 'debug' enabled.
2024-07-17T12:18:55.441Z ERROR 24476 --- [app] [           main] o.s.boot.SpringApplication               : Application run failed

org.springframework.context.ApplicationContextException: Failed to restore CRaC checkpoint on refresh
        at org.springframework.context.support.DefaultLifecycleProcessor$CracDelegate.checkpointRestore(DefaultLifecycleProcessor.java:539) ~[spring-context-6.1.10.jar!/:6.1.10]
        at org.springframework.context.support.DefaultLifecycleProcessor.onRefresh(DefaultLifecycleProcessor.java:194) ~[spring-context-6.1.10.jar!/:6.1.10]
        at org.springframework.context.support.AbstractApplicationContext.finishRefresh(AbstractApplicationContext.java:981) ~[spring-context-6.1.10.jar!/:6.1.10]
        at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:627) ~[spring-context-6.1.10.jar!/:6.1.10]
        at org.springframework.boot.web.servlet.context.ServletWebServerApplicationContext.refresh(ServletWebServerApplicationContext.java:146) ~[spring-boot-3.3.1.jar!/:3.3.1]
        at org.springframework.boot.SpringApplication.refresh(SpringApplication.java:754) ~[spring-boot-3.3.1.jar!/:3.3.1]
        at org.springframework.boot.SpringApplication.refreshContext(SpringApplication.java:456) ~[spring-boot-3.3.1.jar!/:3.3.1]
        at org.springframework.boot.SpringApplication.run(SpringApplication.java:335) ~[spring-boot-3.3.1.jar!/:3.3.1]
        at org.springframework.boot.builder.SpringApplicationBuilder.run(SpringApplicationBuilder.java:149) ~[spring-boot-3.3.1.jar!/:3.3.1]
        at de.app.platform.aggregate.CustomApplicationBuilder.run(CustomApplicationBuilder.java:36) ~[app-platform-aggregate-18.0.0-b002eaed.jar!/:18.0.0-b002eaed]
        at de.app.MyApplication.main(MyApplication.java:10) ~[!/:18.0.0-b002eaed]
        at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) ~[na:na]
        at java.base/java.lang.reflect.Method.invoke(Method.java:580) ~[na:na]
        at org.springframework.boot.loader.launch.Launcher.launch(Launcher.java:91) ~[app-18.0.0-b002eaed.jar:18.0.0-b002eaed]
        at org.springframework.boot.loader.launch.Launcher.launch(Launcher.java:53) ~[app-18.0.0-b002eaed.jar:18.0.0-b002eaed]
        at org.springframework.boot.loader.launch.JarLauncher.main(JarLauncher.java:58) ~[app-18.0.0-b002eaed.jar:18.0.0-b002eaed]
Caused by: org.crac.RestoreException: null
        at org.crac.Core$Compat.checkpointRestore(Core.java:150) ~[crac-1.4.0.jar!/:na]
        at org.crac.Core.checkpointRestore(Core.java:237) ~[crac-1.4.0.jar!/:na]
        at org.springframework.context.support.DefaultLifecycleProcessor$CracDelegate.checkpointRestore(DefaultLifecycleProcessor.java:530) ~[spring-context-6.1.10.jar!/:6.1.10]
        ... 15 common frames omitted
        Suppressed: java.security.PrivilegedActionException: null
                at java.base/java.security.AccessController.doPrivileged(AccessController.java:575) ~[na:na]
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:230) ~[na:na]
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:294) ~[na:na]
                at java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:273) ~[na:na]
                at jdk.crac/jdk.crac.Core.checkpointRestore(Core.java:72) ~[jdk.crac:na]
                at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) ~[na:na]
                at java.base/java.lang.reflect.Method.invoke(Method.java:580) ~[na:na]
                at org.crac.Core$Compat.checkpointRestore(Core.java:141) ~[crac-1.4.0.jar!/:na]
                ... 17 common frames omitted
        Caused by: java.lang.ClassNotFoundException: /<path-to-jar>/app-18.0.0-b002eaed.jar
                at java.base/java.lang.Class.forName0(Native Method)
                at java.base/java.lang.Class.forName(Class.java:534)
                at java.base/java.lang.Class.forName(Class.java:513)
                at java.base/jdk.internal.crac.mirror.Core$2.run(Core.java:233)
                at java.base/jdk.internal.crac.mirror.Core$2.run(Core.java:230)
                at java.base/java.security.AccessController.doPrivileged(AccessController.java:571)
                ... 24 common frames omitted

CRIU Logs: dump4.log restore.log

sdeleuze commented 1 month ago

If you are using containers, be aware that configuring capabilities may be required, see https://github.com/sdeleuze/spring-boot-crac-demo/blob/main/restore.sh for an example. Also you may want to ensure the path app-18.0.0-b002eaed.jar does not change (which could be the case with volumes, etc.)

Is app-18.0.0-b002eaed.jar the executable JAR of your Spring Boot app?

shmyer commented 1 month ago

I am not in a container environment. I am on a Linux VM on a VMWare Host. Could capabilities still be an issue here? I am currently on a 4.12 Linux kernel, which does not have the CHECKPOINT_RESTORE capability yet. It seems like on older Linux kernels the capability SYS_ADMIN is the one required for checkpoint/restore. I am using a non-root user.

However, as far as I understood CRIU is nevertheless running as root, since one thing I had to let our sysadmins do was this here: https://docs.azul.com/core/crac/crac-debugging#failures-in-native-checkpoint-or-restore

sudo chown root:root /path/to/criu
sudo chmod u+s /path/to/criu

Without that it didn't get past the CRIU part of the restore. But according to my restore.log the CRIU part of the restore seems to be working now.

Yes, the file's location is the same during the creation of the checkpoint and during the restore. Yes, this Jar file is the executable JAR of my Spring Boot app.

sdeleuze commented 1 month ago

Looks like more a JDK/CRaC level issue so not sure what we can do about it on Framework side, do you agree?

spring-projects-issues commented 1 month ago

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

shmyer commented 1 month ago

I guess you're right. In the end I've decided to abandon my plans to use CRaC. It doesn't seem mature enough to me.