nexr / RHive

RHive is an R extension facilitating distributed computing via Apache Hive.
http://nexr.github.io/RHive
122 stars 63 forks source link

File not found exception while running queries on UDAF #98

Open Lxmnkmr opened 8 years ago

Lxmnkmr commented 8 years ago

I've created a new UDAF as per below commands

hsum <- function(prev, sal) { if(is.null(prev)) sal else prev + sal }

hsum.partial <- function(agg_sal) { agg_sal }

hsum.merge <- function(prev, agg_sal) { if(is.null(prev)) agg_sal else prev + agg_sal }

hsum.terminate <- function(agg_sal) { agg_sal }

rhive.assign('hsum', hsum) rhive.assign('hsum.partial', hsum.partial) rhive.assign('hsum.merge', hsum.merge) rhive.assign('hsum.terminate', hsum.terminate) rhive.exportAll('hsum')

Everything was fine till then and also file "hsum.RData" has been created in "/rhive/udf/user_name" directory. But then when I'm using the UDAF to run query on one of my tables as:

rhive.query("select RA('hsum', prev, sal) from employee")

It gives me error: "Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask"

and when I checked the log, I found below exception:

org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"emp_id":667266,"prev":6000,"sal":8000} at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:562) at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163) ... 8 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.FileNotFoundException: File /rhive/udf/user_name/hsum.RData does not exist at com.nexr.rhive.hive.udf.RUDAF$GenericRUDAF.loadRObjects(RUDAF.java:525) at com.nexr.rhive.hive.udf.RUDAF$GenericRUDAF.iterate(RUDAF.java:244) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:185) at org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:612) at org.apache.hadoop.hive.ql.exec.GroupByOperator.processHashAggr(GroupByOperator.java:787) at org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:693) at org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:761) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838) at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838) at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:117) at org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:167) at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:552) ... 9 more Caused by: java.io.FileNotFoundException: File /rhive/udf/user_name/hsum.RData does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:609) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:822) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:599) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289) at org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(LocalFileSystem.java:88) at com.nexr.rhive.hive.udf.RUDAF$GenericRUDAF.loadRObjects(RUDAF.java:517) ... 21 more attempt_1466028339333_0021_m_000000_1

As I mentioned above,"hsum.RData" has already been created in "/rhive/udf/user_name" directory when I ran the export command, but I still get the exception: "java.io.FileNotFoundException: File /rhive/udf/user_name/hsum.RData does not exist at com.nexr.rhive.hive.udf.RUDAF$GenericRUDAF.loadRObjects"

Lxmnkmr commented 8 years ago

As per my understanding, the .RData file will be created and accessed from /rhive/udf/user_name directory in Hadoop file system and the error I get shows file not found in that drectory. Just to give it a try, I created the /rhive/udf/user_name directory on local file system of all my datanodes and I copied the .RData file to them manually and then the query worked without any error!!!

Any inputs on why is it looking for the .RData file on local file system of datanodes instead of hadoop file system? Also, this happens only when I use UDAFs, while working on UDFs I don't face any error and the .RData file is created and accessed from hadoop file system only.