rh-jmc-team / jigawatts

Build a jar file for easier access to CRIU from Java
Other
11 stars 5 forks source link

Questions about hooks #10

Open tjwatson opened 3 years ago

tjwatson commented 3 years ago

Experimenting with CRIU for a relatively complex application (in my case Open Liberty https://github.com/OpenLiberty/open-liberty) I have my doubts on the hook for CheckpointRestore.

My main doubt is in the requirement to serialize on the dump operation and deserialize on the restore operation. If on a dump the complete process and all of its state is saved in the image dump then why do we also need to save the state of the hooks with serialization separately? Would we not already have the hooks objects available to us once we restore such that we can call the restore side straight away when the process is resumed?

For Open Liberty I likely will need to introduce my own hook SPI/API [1] to give us the flexibility to have subsystems within Liberty to participate in the prepare/restore operations anyways, but my current thinking is to not have the hooks get serialized at all. The general flow is

  1. Gather up all the hook implementations into a snapshot
  2. Invoke prepare on all the hooks
  3. Perform the CRIU dump (I currently modified the JavaCriuJar to terminate the process on dump)
  4. Use criu command to restore, I would like to avoid invoking the JVM to restore to avoid overhead of firing up a Java process
  5. The process picks up from when it was frozen, the thread that invoked the dump picks right up and has a reference to the snapshot of hooks used during the prepare phase that thread then immediately calls restore on the snapshot of hooks that it called prepare on.
  6. Each hook will have its own state that is persisted with the dumped image. On restore, from the hook POV, it thinks it is the same object and will be able to restore anything it was responsible for in prepare.

My reason for opening this issue is to discuss the strategy for how the hooks work in JavaCriuJar to make sure I am not overlooking something in my strategy that I described above for Open Liberty.

If my understanding is correct we could decide to push a similar strategy down into the JavaCriuJar library. The difficulty I will have with that is that I need the ability to separate the CheckPointRestore/Hook API from the implementation such that the implementation can be installed on top while lower levels of Open Liberty can participate and implement the restore/prepare hooks without requiring the actual implementation of CheckPointRestore be available at runtime. One solution to that is to separate the JavaCriuJar into to JARs: 1) for API 2) for implementation. On the other hand I am also fine with Open Liberty having its own hook API and only using JavaCriuJar for invoking the criu lib calls to perform the dump.

[1] https://github.com/tjwatson/open-liberty/blob/criu/dev/com.ibm.ws.kernel.boot.core/src/io/openliberty/checkpoint/spi/SnapshotHook.java

chflood commented 3 years ago

Do you forsee needing to impose an order on the hooks?

I'm happy to discuss a better option than serialization.

tjwatson commented 3 years ago

I would like to clear up some confusion on my part about the management of:

org.checkpoint.CheckpointRestore.restoreHooks

  1. In CheckpointRestore.saveTheWorld(String) each element of restoreHooks is serialized to a file JavaRestoreHooks.txt
  2. Then CheckpointRestore.saveTheWorldNative(String). Will the state of the process contain a restoreHooks list already fully populated within the saved image?
  3. At this point say the Java application continues and eventually is exited.
  4. Then later a new process is started and invokes the method CheckpointRestore.restoreTheWorld(String). This should restore the state which I think should restore all the objects of the Java process saved in step 2. At this point would we not have a restoreHooks list fully populated with the hooks that were saved to the CRIU image?
  5. Then code deserializes the serialized hooks from step 1. Will this result in duplicates being added to the restoreHooks? Ones saved as part of the CRIU image and ones saved during object serialization by the JavaCriuJar code.

I will admit my grasp of how the restore side works when transforming the running Java instance (running the CheckpointRestore.restoreTheWorld(String) method) into the restored process is pretty sketchy. My initial reaction is that it would be far more reliable to just invoke the criu restore command directly to restore and depend on execution picking right backup at the point the process was saved and restore just happens from right there against the exist hooks that were present at the time of save.

As for order, I was planning on that being able to be controlled from the Open Liberty side in the registration of the hooks on our side. As of now I don't have good examples where hook order matters, but I will not be surprised if it comes up. But this does bring up a good point if the same hook object does the prepare and restore then it may make sense to do the restore in the reverse order of the prepare order.

vijaysun-omr commented 3 years ago

I think one scenario in which order of hooks may matter is if we had a distinction between "application" hooks and "JVM" hooks, e.g. we might want to compact the Java heap and release unneeded memory to the OS in the JVM hook as one of the last actions before taking a snapshot to keep its size small.

chflood commented 3 years ago

Currently restore hooks only get run if you use the java restore interface.

I need to think a little bit more about a way to do this when restore happens via the command line. It's not obvious because you restore back to where you were at the checkpoint code and there isn't a good place to add the call to execute the restore hooks only if you are in the criu restored jvm and not in the original jvm. I may be making this too hard.

On Mon, Apr 26, 2021 at 3:44 PM Vijay Sundaresan @.***> wrote:

I think one scenario in which order of hooks may matter is if we had a distinction between "application" hooks and "JVM" hooks, e.g. we might want to compact the Java heap and release unneeded memory to the OS in the JVM hook as one of the last actions before taking a snapshot to keep its size small.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/chflood/JavaCriuJar/issues/10#issuecomment-827096624, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE3CLBHTXNO3FLKO4Y4WXFDTKW7ANANCNFSM43TCTA7Q .