s-u / rJava

R to Java interface
https://RForge.net/rJava
235 stars 77 forks source link

Unloading rJava and/or restarting JVM #25

Open jeroen opened 10 years ago

jeroen commented 10 years ago

[Repost from this SO question].

I would like to use rJava in combination with mcparallel but obviously the JVM cannot be forked. Therefore a separate JVM instance needs to be initiated for each child process, e.g:

library(rJava)
library(parallel)
myfile <-  system.file("tests", "test_import.xlsx", package = "xlsx")

#This works:
mccollect(mcparallel({
  #Automatically initiates JVM in child
  xlsx::read.xlsx(myfile, 1)
}))

However the problem in my case is that the JVM has already been initiated in the (main) parent process as well. This makes it impossible to use rJava in the child process:

#init JVM in parent
.jinit()

#Doesn't work anymore
mccollect(mcparallel({
  xlsx::read.xlsx(myfile, 1)
}))

So what I really need is a way to shutdown/kill and restart the JVM in the child process. Simply detach("package:rJava", unload = TRUE) doesn't seem to do the trick. The force.init parameter doesn't seem to result in a restart either:

#Also doesn't work:
.jinit()
mccollect(mcparallel({
  .jinit(force.init = TRUE)
  xlsx::read.xlsx(myfile, 1)
}))

Is there some way I can forcefully shutdown/kill the JVM in order to reinitiate it in the child process?

asieira commented 10 years ago

:+1:

Apart from the mclapply scenario described by @jeroenooms, there is another concern: the JVM grabs heap memory and never again returns it to the OS.

So, if you have a task that will perform better in Java but requires a large amount of memory, I would really like to able to:

Thus, being able to load/unload the JVM several times over time in a single R program could be really useful. Even better if the heap size (as per -Xmx) can be easily set when the JVM is loaded.

gasse commented 10 years ago

I have exactly the same need as you @asieira, I run a machine learning algorithm which consumes a lot of memory, from the R extraTrees package (http://cran.r-project.org/web/packages/extraTrees). I need to run the algorithm several times on benchmark data, and after a while I repeatedly get an OutOfMemory error.

My understanding of the problem is that there must certainly be some references to Java objects which are kept undefinitely in my R session between each call to extraTrees(), and thus can not be cleaned by the JVM's garbage collector. After some repeated calls, the JVM heap grows too much and ends in an OutOfMemory exception.

Is there a correct way to overcome the problem? I thought about restarting the JVM each time but it doesn't seem to work:

library("rJava")
library("extraTrees")

for (...) {
  .jinit(force.init = TRUE, parameters="-Xmx4g")
  .jpackage(name = "extraTrees")

  ...
  my.model = extraTrees(train.x, train.y)
  ...
}
asieira commented 10 years ago

@gasse yours seems like a completely different issue.

The point @jeroenooms and myself are making with this issue is that we would very much like to be able to stop the entire JVM. For example, the implementation of a .junload function that would return the R process to having no JVM associated with it.

The first thing I would check on your case is whether you have a 32-bit or 64-bit JRE installed by running java -version on the CLI. Since you're allocating 4 gigs to the JVM, you should be using a 64-bit JRE or it will be limited to 2 gigs. Hope this helps.

gasse commented 10 years ago

Hi, the problem may be quite different but a solution to your problem (clean Java memomy once and for all) is also a solution to my problem (clean Java memory after each call to the extraTrees() function).

I am running 64-bits JRE, and I am also interested in being able to change the heap size of the JVM without having to restart my entire R session, by restarting the JVM for example.

asieira commented 10 years ago

I see what you mean now, @gasse. So yes, being able to unload and reload the JVM at different times with different heap sizes would allow to accommodate for any memory leaks and/or differences in data size between runs.

Hope the nice people that wrote rJava get to work on this issue soon.

s-u commented 10 years ago

There are two problems here and those are the main reason why this is not supported and why it wouldn't work in most cases even if rJava allowed it. First, rJava has no control over any existing Java references, so it's essentially impossible to shutdown the JVM. Even if rJava would try it, those references would lead to a crash. In addition, destroying the VM is a voluntary operation, e.g., if ay threads are running, they will prevent the VM from shutting down.

So, it's impossible to solve the practical issue above - if the JVM is already started before you get control, then whichever code started it has already created references that exist and cannot be removed (without that code providing a way to shut down and clear the references).

What rJava could so is to provide ways to remove its own references and attempt a shutdown. However, for the above reasons that is pretty much guaranteed to fail, because it cannot release its class loader as long as there is even a single live object around which is pretty much guaranteed (unless someone just called .jinit() but didn't create any object - which is unlikely).

jeroen commented 10 years ago

Thank you for your answer, Simon. Do I understand it correctly, that even if we were to detach and unload all R packages that depend/import rJava, there is still no way to signal the jvm to exit? How does this work when the the R process itself exits/dies? Does the jvm get terminated in that case, or is it up to the OS to clean up?

asieira commented 10 years ago

Hi, @s-u. Thank you for replying.

Looking at the Oracle documentation on unloading the VM here I can only see an issue being raised with other threads left running. So yes, the R programmer would need to be diligent in making sure that no other threads are forgotten about in a running state.

I am not clear though on what you mean by references. Are you concerned that R variables containing underlying C pointers to Java data that no longer exists will be accessed and cause a segmentation fault or similar?

s-u commented 10 years ago

@jeroenooms yes, the only reliable way to terminate the JVM is via System.exit() which kills the entire process (including R itself). As for unload, in principle that would get us closer, but we would also have to guarantee that there are no objects from these packages still assigned in any environment. I fear that this is not practical, but if you think it is realistic, then, yes, I could add an attempt to rJava to relinquish all its references and try a JVM termination. If that actually works is not guaranteed, but the best rJava can do.

@asieira yes, exactly. The only way for those to get removed is if there is no reference to them anywhere in R. rJava also keeps some references cached internally, but those I can remove on rJava unload.

asieira commented 10 years ago

@s-u don't all Java references encapsulated in R objects obtained through rJava? So, isn't it possible in theory to:

Is any of that even remotely feasible? Sorry if I'm suggesting vague and impossible things here, not really sure how the R and C integration works.

s-u commented 10 years ago

No, because rJava doesn't own those references. Except for a few internal objects rJava releases all references to objects created through its API, so they get released by R if they are not in use. If they are in use, then even if rJava would keep track of them, they may not be released since someone uses them.

Technically, it would be possible to keep a list of all allocated references and force-deallocate them, but it would be very expensive and break all packages that use rJava if someone destroys the JVM. As a user I would be certainly not happy about it such a behavior.

asieira commented 10 years ago

I agree, that would be awful for other packages.

What if the packages could provide two optional functions to .jinit that would be called respectively prior to unloading (to free up references) and when the JVM is reloaded (to return to a sane state), so they could handle this scenario themselves?

In this case, if would still be backwards compatible if the JVM was never unloaded. And if it was, it would fall upon to the user to ensure he only loads rJava-dependent packages that can handle this situation. Maybe you could just fail the unloading if at least one package is present that is loaded but didn't register these optional functions in .jinit to make things even safer.

ManuelB commented 8 years ago

Hi, just another bug report. I use RJDBC to connect to an SQL Server. When I requests a big table (1.400.000 rows). I get:

Error in .jcall(rp, "I", "fetch", stride, block) : 
  java.lang.OutOfMemoryError: Java heap space

It would be nice if it would be possible to set the max heap size:

options( java.parameters = "-Xmx2g" )

And and afterwards restart the JVM.

As a workaround I can just restart the whole rsession in RStudio.

/Manuel

s-u commented 8 years ago

Yes, you should really set java.parameters in your startup scripts such that it is always active. That said, I think you don't actually need it for RJDBC - you only need to run gc() between the fetch() calls so Java can release the data you already got on the R side.

russellpierce commented 8 years ago

If I read the conversation up until now correctly, we are limited to one JVM per application and we can't guarantee/control a close of a JVM without generating overhead.

Follow up naive question. It is conceptually possible to somehow isolate the JVM on a separate process and then accomplish communications between that separate process and R via a socket? A back of napkin script with DBI/RJDBC seemed to confirm that I could hold a separate db connection on a single SNOW node on localhost at the same time as the db connection on the master node. Of course in that case the communication with the JVM happens entirely within the slave node rather than needing to access/interact with R objects on the master - so maybe there is no way to accomplish anything like this would a massive rewrite?

Of course, the socket would have to invalidate on a fork and spawn its own JVM - but that part seems trivial relative to the potential costs/issues of isolating the JVM on a separate process/in a separate application. There would be overhead, of course, so this probably would have to be implemented as an option rather than as a default mode of operation.

s-u commented 8 years ago

Sure, it's relatively easy to have separate processes. However, the point of rJava is exactly to NOT have a separate process, otherwise you could have used Rserve, SNOW or anything like that which has other benefits - but speed and memory efficiency are not among them. So you should pick the tool that is best for the requirements you have.

x1o commented 5 years ago

To everyone reading this who have problems with the JVM not releasing the heap memory back to the OS: @asieira 's claim that

the JVM grabs heap memory and never again returns it to the OS.

is true only for the default GC in some early version of Java (prior to 1.8 IIRC). It is indeed possible to choose a GC that would release the heap on performing the full garbage collection, see https://www.javacodegeeks.com/2017/11/minimize-java-memory-usage-right-garbage-collector.html. Furthermore, some (modern) GC's do not even require a full garbage collection to keep the JVM's memory footprint down to the amount it actually uses, see https://openjdk.java.net/jeps/346 and https://wiki.openjdk.java.net/display/shenandoah/Main.

s-u commented 1 month ago

While trying to address #334 I have now final proof that it is impossible to truly unload or even re-initialize a VM. I have implemented a fully clean approach (just pure C program) which dynamically loads a JVM with RTLD_LOCAL, initializes it, calls java.lang.System.getProperty("java.version"), frees all references and then destroys the JVM with DestroyJavaVM. It turns out the JVM is never released even if the process which loaded it unloads libjvm. Surprisingly, the same is true if one extra level of indirection is inserted (a separate dynamic library loads JVM instead of the mains process - the library itself is unloaded, but libjvm is not). In addition, any attempts to re-create the JVM after using DestroyJavaVM fail with JNI_CreateJavaVM error. (All tests performed with LTS versions of OpenJDK Temurin). Hence I have to conclude that this is simply impossible with the current JVM implementations.

chrisknoll commented 1 month ago

Thanks for the update @s-u . Not to hijack this thread but I do have a question related to JVM instantiation:

Will it ever be the case where JVM instances wil be shared across child R Session processes such that the rJava interface to calling out to the JVM host would amount to basically a new thread invocation on a shared VM across different R Sessions, or is it the case that as long as child R sessions materialize in separate processes, you will get a new JVM?

And taking this a little further, if ever R manages to introduce a way of parallel compute which doesn't involve new R processes (ie: work more single-process/multi-thread) would that mean that child 'sessions' of R would share a process and therefore the JVM?

I'm asking these so that we can avoid any pitfalls around some of our programming decisions about these technical considerations.