jpmml / jpmml-evaluator-r

PMML evaluator library for R
GNU Affero General Public License v3.0
0 stars 0 forks source link

The predictions take very long #1

Open sadsquirrel369 opened 2 days ago

sadsquirrel369 commented 2 days ago

Hi there,

I've tested the pmml evaluator in Python which was more than 100x faster than the one in R. It looks like it is only using a single core on my machine. Is there anything I can do to speed it up?

sadsquirrel369 commented 2 days ago

image

vruusmann commented 1 day ago

I've tested the pmml evaluator in Python which was more than 100x faster than the one in R

This is a known issue to me.

The project is currently in pre-release status. Just wanted to do a proof-of-concept, to see if all the bits and pieces are available to make the R wrapper possible.

The prototype is successful in a sense that predictions can be made. The next step is to make it performant.

vruusmann commented 1 day ago

I believe that the majority of time is spent in formatArguments and parseResults methods, which deal with transforming an R data container (a named list) to a Java data container (a java.util.HashMap), and back.

I can think of two workarounds:

  1. Write the named list from R to Java straight as a list object in RDS data format, and then read the results back as another list object. The trouble is that currently the Java side can read/parse RDS, but it cannot format/write RDS.
  2. Stop using the named list data container altogether, and do data exchange using plain CSV data format.

I personally find the first workaround a bit more elegant-ish, but it needs some research and development around RDS data format. Then again, the RDS formatter must only support named lists (plus R scalar types), so hopefully it's not a lot of work.

@sadsquirrel369 If you have any other ideas how to implement the data exchange between R and Java more efficiently, please share.

vruusmann commented 1 day ago

I've tested the pmml evaluator in Python which was more than 100x faster than the one in R

Another thing is that the Python wrapper supports batch prediction mode (pass 1'000 data records back and forth at once), whereas the R wrapper doesn't - it emulates over batch rows on the R side, and performs single/elementary prediction operation on each of them.

So, the R wrapper should be able to exchange "a list of named lists" with the Java side atomically.