s-u / REngine

General Java interface to R supporting multiple back-ends such as JRI and Rserve
Other
67 stars 47 forks source link

Large array conversion #30

Open FeatureMan2 opened 1 year ago

FeatureMan2 commented 1 year ago

Hi,

I'm facing the issue that when converting Java data to the R representation, my datasets are too large and it fails.

Specifically, it fails here https://github.com/s-u/REngine/blame/master/Rserve/protocol/REXPFactory.java#L476 as cont.asDoubles().length is 700K which x8 is about 5 billion, which is larger than the max int integer value of 2 billion. Shouldn't we be using long instead of int throughout this package to support larger datasets?

Thanks for any thoughts on this.

s-u commented 1 year ago

@FeatureMan2 Unfortunately, this is a Java limitation, because in Java arrays can have at most 2^31-1 elements and are indexed by signed integers. Java does not support long as array indices.

You could split the dataset and then c or rbind it back. Note that even if we changed the whole way packets are assembled in REngine, let's say using arrays of arrays instead, it would only increase the capacity at most by factor or 8, because your dataset arrays have the same limitation.

Put simply, Java is not a language designed for data analysis, especially not large data, so you may have more luck reading the data directly from R instead if you can.

FeatureMan2 commented 1 year ago

Thanks for your insights. I think I'll use your fallback method with c/rbind.

FeatureMan2 commented 1 year ago

I tried the c approach to some initial success but then ultimately got stuck again.

I split the large array into chunks then reassembled it in R with c. But after that the engine would fail on random statements - even simple string assignment operations.

So then I switched to using the Renjin engine to export the Java dataset to their embedded R engine, then storing it to an .Rdata file, then loading it back into the REngine environment. The idea was to bypass sending so much data over sockets. It worked, but again I faced the random error issue again...

I'm not quite clear what is producing these errors. I thought it was RAM usage but then I can run

a <- rep(pi, 2^31-1)
b <- rep(pi, 2^31-1)
c <- c(a,b)

using up >64GB of Rserve memory...

s-u commented 1 year ago

Can you post your actual code? We can't really help you based on the anecdotes above.

FeatureMan2 commented 1 year ago

The code is long and unfortunately I'm not able to share it.

It essentially fails at this point https://github.com/s-u/REngine/blob/ba09e3bd0fc5f2f99296451f96c7eb2422621226/Rserve/RConnection.java#L286 (this is with the old version REngine 0.9.2 under R v3.6, but get same errors under 2.1.1 under R 4.2) where ro.isOK() returns false. I'm simply assigning a string to a variable here. I don't think it matters what statement I execute, it's just that it falls over irrespective. I cannot manually execute any statement in debug mode after such an error.

This error comes after loading a R numeric array of 650 million elements like so load(file="myfile1.RData") and then executing a few more trivial assignment statements, then trying to load another similar 650 Million elements file. What's weird is that it doesn't fail on the loading of the RData file, but just before on assigning the statement load(file="myfile2.RData") into a string variable before running R's eval() on it (just like it loaded myfile1.RData previously). The Linux Rserve process is using about 15GB of RAM at this point. The error codes I get are typically 17 or sometimes 98.

Note that this same code works just fine on smaller datasets, say if I have 5 million elements...

I was getting similar types of errors when doing the splitting arrays into smaller chunk and concatenating with c().

s-u commented 1 year ago

It looks as if something goes wrong with the previous statement. It could also be that R simply runs out of memory. First, simply run the debug version of Rserve (either set debug=TRUE in Rserve() or run R CMD Rserve.dbg instead of R CMD Rserve)- that way you can see what is happening. Second, can you create a reproducible example you can share? You could, for example, mimic the dataset with something like data.frame(a=1:1e9/2) by creating similar size and types but not revealing your data.

FeatureMan2 commented 1 year ago

Our problem was solved by splitting the datasets Java-> R using the suggested c(a,b) trick, then also splitting the return values R -> Java (as these were large too). This in combination with the use of the judicious usage of the RConnection.voidEval() instead of the default REngine.parseAndEval() - otherwise R statements with large return value were automatically being transmitted to Java leading to misleading errors due to I assumed to be a compromised state of the Rengine after execution of such statements.

I would suggest making the REngine.voidEval() method available since it is not by default. Hope this helps someone. Maybe also worth updating the Javadoc or main documentation stating that large arrays (with ~200M+ elements) will not work and suggest work-arounds.

s-u commented 1 year ago

Thanks. Since this is a Java limitation different applications may use different solutions to it depending on the use-case.

RConnection.voidEval(x) is the same as REngine.eval(x, null, false). Besides, you have always control over what you return from the evaluation - most commonly you return some indicator of success. I have the feeling that you would be probably better off evaluating compound statements since it is more efficient that using multiple voidEvals.