Open GoogleCodeExporter opened 8 years ago
The present method of invoking R from Jaql uses rjson package to parse the
arguments.
I ran a quick experiment to measure the overhead, where I do a group by, and
invoke
the R function "sum" on a group of elements. There are about ~480 R invocations
on 8
reducers.. and each group has on an average 1000 elements (an array of 1000
numbers).
The program has been running for the last 6 hours, and has made 10% progress in
the
reducer beyond the shuffle and sort phase (which accounts for 66% of the work).
The
corresponding jaql job takes 5mins and 3 secs to complete.
I am proposing the following changes:
* Adding functionality to rJaql package to do the interfacing.
* On the Jaql side, introducing two different functions, RValue and RBinary.
Both of
them will have the same behavior in terms of execution, but with RValue, the
result
from R will be converted to the appropriate JSON type, and in RBinary, it is
kept as
a BINARY object, and passed to R.
* Again same principles will be used for data transfer as in the rJaql package.
Let me know your comments.
Original comment by sudipt...@gmail.com
on 18 Sep 2009 at 5:55
Yes, rjson is not exactly a high-performance package. The proposed changes sound
good; the only minor change is that we might do everything with one function
with an
appropriate argument. Something along the lines of
R(..., binary=true)
or
R(..., mode="binary")
for more flexibility (when we add enums to the schema, the second option is
clearly
preferable). In both cases, binary is the default.
What is your current draft of the API to pass value to R?
Original comment by Rainer.G...@gmx.de
on 18 Sep 2009 at 6:52
The present draft of the API is similar to the RFn now in the code base. I am
adding
the new functions, which can eventually replace RFn. I can make changes to RFn
as
well if everyone wants that. The first argument to RFn is to provide an
initializing
script to R. Right now it is a big string which is R code. I want it to be
location
of an R file, which will be loaded to distributed cache before the Job starts,
and
that script will be sourced to the slave R instances. The second argument is
the name
of the function to execute. This function can either be a function which was
sourced,
or an R built-in.. The remaining arguments are arguments to the function.
In the present design.. all arguments to the function are converted from Jaql
to R
types using rjson. I am proposing replacing it with rJaql, and transformation
rules
and data transfer rules will be similar to that in rJaql.
Original comment by sudipt...@gmail.com
on 18 Sep 2009 at 5:23
How can we distinguish arguments given as binary, as JSON value, or as table?
Or is
this something that is dealt with in user code on the R side?
Parallel to this issue, what are your thoughts about an API like this
R(init, function, args, initIsFile=false, binary="binary")
so that init can also be a string. We might then add other arguments as needed,
e.g.,
arguments that control reuse of R processes (yes/no) and clean-up functions (in
between calls of same R process; to shut down an R process).
Original comment by Rainer.G...@gmx.de
on 18 Sep 2009 at 5:40
Binary is meant to be dealt by user specified R code, because Jaql would not
want to
deal with it.
If there is an initializing script, why would you like an initialize string?
And the R function can have a variable number of arguments.. so it is better to
put
the necessary arguments such as binary.. before the arguments to the RFn start.
Original comment by sudipt...@gmail.com
on 18 Sep 2009 at 5:47
After a discussion with Rainer, this is what we are proposing:
I will modify the RFn. It will now have the following arguments.
R(init, fnname, arguments, schema=null,init_inline=false,binary=false)
The function argument listing are following the new call by name feature
introduced
in the branch.
init: The initialization script. It can either be an inline R script, or path
to a
file. The interpretation is governed by the argument init_inline (which is
optional).
By default the function expects initialization to be path to a script (this is
subject to change depending on the general use case).
fnname: The name of the R function to execute. It can be a function in the init
script, or a built-in.
arguments: This is an array of arguments to be passed to the R function. we
want it
an array so that we know it is one object, and so we can do away with the
requirement
for variable number of arguments. This is because variable number arguments do
not
tie in with the optional named arguments. Again as described earlier,
conversion from
Jaql to R will be governed by rJaql transformation rules.
schema: an array of schemata which is a parallel array with the array of
arguments
that will override the inferred schema of the arguments. This is needed since
conversion rules are guided by schema, and if schema inference is not accurate,
then
conversion will fail. This argument is optional, and if specified, it will
override
the inferred schema.
init_inline: default value is false, which means that the passed init argument
is
assumed to be a path to the R init script.
binary: This governs how the result of the R function will be interpreted. If
true,
the result is treated as BINARY and JAQL doesn't need to peek into it. If false
(default), then Jaql converts the returned object into an R type.
Kevin and Vuk, let me know your comments. I might have some more changes as I
start
implementation.
Original comment by sudipt...@gmail.com
on 18 Sep 2009 at 9:00
Checked in the first working version of RFn with most of the features enabled.
Refer
to r335. Subtle changes to the invocation. It now looks like
R(fn, args=[item arg1, ..., item argN],inSchema=[schema arg1, ...,schema argN],
outSchema=null, init=null, initInline=true, binary=false, flexible=false)
At this point it uses rjson for conversion of R object back to json, when
binary=false. The present version also does not support initInline=false.
Issues: Schema inference does not work even when outSchema is passed.
Original comment by sudipt...@gmail.com
on 23 Sep 2009 at 9:21
Original issue reported on code.google.com by
sudipt...@gmail.com
on 17 Sep 2009 at 1:17