Closed carabolic closed 8 years ago
@FelixNeutatz Is this solved now, with the adjusted akka.framesize
?
Yes couple there were a couple of things that needed to be tweaked in order to get the job running on cluster. I've added a section Notes
to the issue description.
We tried to run the LinRegDS.dml script on the cluster (cloud-11) with
flink_hybrid
mode and an 80GB training matrix. But the jobs fails due toFull Exception
Reproduce the Bug
The test data is already in HDFS at
/user/cbruecke/systemml/linRegData_80G.train.bin
and/user/cbruecke/systemml/linRegData_80G.labels.bin
respectively.It is necessary that at all times Yarn and HDFS is running.
Configuration
Mainly copied from the SystemML guide.
in
mapred-site.xml
in
yarn-site.xml
Data Generation
Data Preparation
Running in FLINK_HYBRID Mode
This is the phase that is actually failing. But the first phase which includes
TSMM
succeeded. Only the subsequent Flink job includingMAPMM
cannot be submitted due to the akka timeout.Notes
hadoop.version
set to 2.7.1hadoop-mapreduce-client-jobclient.jar
in the classpath