What steps will reproduce the problem?
What is the expected output? What do you see instead?
What version of the product are you using? On what operating system?
jaql trunk r248
Please provide any additional information below.
There seems to be some weird problem with KeyLookup that has popped up
recently. It seems it is unable to read some of the temp file which it
itself is generating. This is new problem since I didn't face this problem
until last night. Here is how it goes.
Following is the JAQL code:
$ratings = read(hdfs('/user/sudipto/netflix/data/all/json'));
$estrate = 0;
$cust = read(hdfs('/user/sudipto/netflix/data/all/materialized/custparam'));
$movie = read(hdfs('/user/sudipto/netflix/data/all/materialized/movieparam'));
$imHashJoin = fn($outer, $okey, $inner, $ikey) (
$build = $inner -> transform [$ikey($), $],
$outer -> transform [$okey($), $]
-> keyLookup($build)
-> transform {$[1].*, $[2].*}
);
$ratings
-> $imHashJoin(fn($r) $r.tid, $movie, fn($m) $m.movie_id)
-> $imHashJoin(fn($r) $r.cid, $cust, fn($c) $c.cust_id)
-> transform { $.cust_id, $.movie_id, $.rating, diff: $.rating - $estrate,
$.cparam, $.mparam }
-> write(hdfs('/user/sudipto/netflix/data/all/materialized/join'));
In the hash join, it first spawns up a MR job to read in the individual
inner tables, and Temps them. Then it tries to join the large outer table
and the temped inner table (This is somewhat new as I think the earlier
version of key lookup did not do so. Probably you wanted to fix the
inlining problem of this expression?). Nevertheless, when the 3rd MR job is
spawned which does the main join, it reports the following error:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
hdfs://impl00.almaden.ibm.com:9000/user/sudipto/jaql_temp_4847551314303483
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:210)
at
com.ibm.jaql.io.hadoop.DefaultHadoopInputAdapter.getSplits(DefaultHadoopInputAda
pter.java:163)
at
com.ibm.jaql.io.hadoop.DefaultHadoopInputAdapter.iter(DefaultHadoopInputAdapter.
java:184)
at com.ibm.jaql.lang.expr.io.AbstractReadExpr$1.(AbstractReadExpr.java:100)
at com.ibm.jaql.lang.expr.io.AbstractReadExpr.iter(AbstractReadExpr.java:99)
at com.ibm.jaql.lang.expr.index.KeyLookupFn.iter(KeyLookupFn.java:72)
at com.ibm.jaql.lang.expr.core.BindingExpr.iter(BindingExpr.java:209)
at com.ibm.jaql.lang.expr.core.TransformExpr.iter(TransformExpr.java:148)
at com.ibm.jaql.lang.expr.core.DoExpr.iter(DoExpr.java:126)
at com.ibm.jaql.lang.core.JaqlFunction.iter(JaqlFunction.java:269)
at com.ibm.jaql.lang.core.JaqlFunction.iter(JaqlFunction.java:350)
at
com.ibm.jaql.lang.expr.hadoop.MapReduceBaseExpr$MapEval.run(MapReduceBaseExpr.ja
va:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198
Following is the explain:
(
$fd_2 = mapReduce({
("input"):(hdfs("/user/sudipto/netflix/data/all/materialized/custparam")),
("output"):(HadoopTemp()), ("map"):(fn($mapIn) ( $mapIn
-> transform each $ ([null, [($).("cust_id"), $]]) )
) }),
$fd_0 = mapReduce({
("input"):(hdfs("/user/sudipto/netflix/data/all/materialized/movieparam")),
("output"):(HadoopTemp()), ("map"):(fn($mapIn) ( $mapIn
-> transform each $ ([null, [($).("movie_id"), $]]) )
) }),
write((
$fd_1 = mapReduce({
("input"):(hdfs("/user/sudipto/netflix/data/all/json")),
("output"):(HadoopTemp()), ("map"):(fn($mapIn) ( keyLookup($mapIn
-> transform each $ ([($).("tid"), $]), read($fd_0))
-> transform each $ ([null, { (index($, 1)).*, (index($, 2)).* }]) )
) }),
keyLookup(read($fd_1)
-> transform each $ ([($).("cid"), $]), read($fd_2))
-> transform each $ ({ (index($, 1)).*, (index($, 2)).* })
)
-> transform each $ ({ (($)).("cust_id"), (($)).("movie_id"),
(($)).("rating"), ("diff"):((($).("rating"))-(0)), (($)).("cparam"),
(($)).("mparam") }), hdfs("/user/sudipto/netflix/data/all/materialized/join"))
)
Note that this problem was encountered when the mapreduce cluster was
running under a username different from the user account used for
submitting jobs from another remote machine.
Original issue reported on code.google.com by sudipt...@gmail.com on 22 Jul 2009 at 2:01
Original issue reported on code.google.com by
sudipt...@gmail.com
on 22 Jul 2009 at 2:01