Open hack-er opened 7 years ago
Hi @hack-er , RDDs can be used without HDFS. HDFS is an optional to run any of the NOUS component. This does seem like an issue with driver broadcasting a distributed object. I have oped a bug for this.
Any update? I've been trying to tinker but have been failing hard to get this to work.
commit 331b66d32d95341696e9ce8d1c6e57df9d983814 addresses the broadcast issue. But the error you mentioned may be a side effect of another issue. To make sure the error is not in the Graph Construction phase, can you first try running the example files successfully ? All the files in "/Mining/examples/wsj/intGraph/" directory are good files to test.
Sorry for coming back so much later, had another project I had to switch to.
I went ahead and tested with the files in that directory and It's working so far as to give me frequency data but it's outputting a blank dependencygraph.txt file. Is this normal behavior? I'm also unable to get any output from anything other than the test data. Is there some documentation on what each of the variables in knowledgegraph.conf do? Any help would be greatly appreciated!
I've successfully built and used the Triple extractor module to produce some output.
Upon moving to the 'Mining' module I had to employ -DskipTests to get the build to finish successfully and then I'm coming accros the following error when trying to create a graph from the triples produced.
`/usr/lib/spark/bin/spark-submit --verbose --jars /root/NOUS/Mining/target/uber-graphmining-1.0-SNAPSHOT.jar --class "gov.pnnl.aristotle.algorithms.DataToPatternGraph" /root/NOUS/Mining/target/uber-graphmining-1.0-SNAPSHOT.jar /root/NOUS/Mining/conf/knowledge_graph.conf Using properties file: null Parsed arguments: master local[*] deployMode null executorMemory null executorCores null totalExecutorCores null propertiesFile null driverMemory null driverCores null driverExtraClassPath null driverExtraLibraryPath null driverExtraJavaOptions null supervise false queue null numExecutors null files null pyFiles null archives null mainClass gov.pnnl.aristotle.algorithms.DataToPatternGraph primaryResource file:/root/NOUS/Mining/target/uber-graphmining-1.0-SNAPSHOT.jar name gov.pnnl.aristotle.algorithms.DataToPatternGraph childArgs [/root/NOUS/Mining/conf/knowledge_graph.conf] jars file:/root/NOUS/Mining/target/uber-graphmining-1.0-SNAPSHOT.jar packages null packagesExclusions null repositories null verbose true
Spark properties used, including those specified through --conf and those from the properties file null:
Main class: gov.pnnl.aristotle.algorithms.DataToPatternGraph Arguments: /root/NOUS/Mining/conf/knowledge_graph.conf System properties: SPARK_SUBMIT -> true spark.app.name -> gov.pnnl.aristotle.algorithms.DataToPatternGraph spark.jars -> file:/root/NOUS/Mining/target/uber-graphmining-1.0-SNAPSHOT.jar,file:/root/NOUS/Mining/target/uber-graphmining-1.0-SNAPSHOT.jar spark.submit.deployMode -> client spark.master -> local[*] Classpath elements: file:/root/NOUS/Mining/target/uber-graphmining-1.0-SNAPSHOT.jar file:/root/NOUS/Mining/target/uber-graphmining-1.0-SNAPSHOT.jar
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties {run=[{batchInfoFilePathbkp=[/root/NOUS/Mining/conf/inputfile.list], batchInfoFilePath=[/root/NOUS/Mining/conf/inputfiles.list], outDir=[.], typeEdge=[0], isoSupport=[1], misSupport=[2], startTime=[2010], batchSizeInTime=[1y], windowSizeInBatch=[3], maxPatternSize=[2], supportScallingFactor=[1000], debugId=[0]}], output=[{frqPatternFilePath=[./output/frequentPatterns.tsv], frqPatternPerBatchFilePath=[./output/frequentPatternsPerBatch.tsv], depGraphFilePath=[./output/dependencyGraph.txt]}]} (**Before reading file, base currentBatchId is ,39) starting map phase1 starting map phase3 > Building graph
NOUS_RUN_START&bid=40&outstring=all frequent pattern of size 1 0&NOUS_RUN_END
(iteration ,1,finding 2^1 max-size pattern) (in join : current batch id ,40)
NOUS_RUN_START&bid=40&outstring=all frequent pattern found with count in joins 0&NOUS_RUN_END
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Can not directly broadcast RDDs; instead, call collect() and broadcast the result. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1409) at gov.pnnl.aristotle.algorithms.DataToPatternGraph$$anonfun$main$2$$anonfun$apply$1.apply$mcV$sp(DatatoPatternGraph.scala:457) at scala.util.control.Breaks.breakable(Breaks.scala:38) at gov.pnnl.aristotle.algorithms.DataToPatternGraph$$anonfun$main$2.apply(DatatoPatternGraph.scala:396) at gov.pnnl.aristotle.algorithms.DataToPatternGraph$$anonfun$main$2.apply(DatatoPatternGraph.scala:212) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at gov.pnnl.aristotle.algorithms.DataToPatternGraph$.main(DatatoPatternGraph.scala:212) at gov.pnnl.aristotle.algorithms.DataToPatternGraph.main(DatatoPatternGraph.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)`
My knowledge_graph.conf is as follows:
[run] batchInfoFilePathbkp = /root/NOUS/Mining/conf/inputfile.list batchInfoFilePath = /root/NOUS/Mining/conf/inputfiles.list outDir = . typeEdge = 0 isoSupport = 1 misSupport = 2 startTime = 2010 batchSizeInTime = 1y windowSizeInBatch = 3 maxPatternSize = 2 supportScallingFactor = 1000 debugId = 0 [output] frqPatternFilePath = ./output/frequentPatterns.tsv frqPatternPerBatchFilePath = ./output/frequentPatternsPerBatch.tsv depGraphFilePath = ./output/dependencyGraph.txt
My inputfile.list is as follows:
/root/NOUS/TripleExtractor/output1
I'm really not sure what I'm doing wrong, please advise. Sorry if this is something really simple.
From what I've googled, it seems RDDs are a feature of the HDFS. Is this now a requirement to running NOUS? The main page says that this is optional.