lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Dynamic PageRank Crashes #209

Closed ianmilligan1 closed 8 years ago

ianmilligan1 commented 8 years ago

I've been working on the graphx branch led by @jrwiebe, and am running into recurring errors when trying to calculate PageRank on the CPP collection.

Script that I'm running is:

import org.warcbase.spark.rdd.RecordRDD._
import org.warcbase.spark.matchbox.RecordLoader
import org.warcbase.spark.matchbox.ExtractGraph

val a=RecordLoader.loadArchives("/collections/webarchives/CanadianPoliticalParties/arc", sc)
var b=RecordLoader.loadArchives("/collections/webarchives/CanadianPoliticalParties/warc", sc)

var recs=a.union(b)

val graph = ExtractGraph(recs)
graph.writeAsJson("nodes-cpp-all", "links-cpp-all")

The error trace can be found here. Note that it fails three times, each with a different error. Any thoughts?

jrwiebe commented 8 years ago

This appears to be an HDFS error, which might be solved by tuning timing settings. Based on all the slow/timeout messages, and some googling, I'd guess there's some issue with garbage collection. Looking at my code, the most costly operation would be the pageRank() call. I'm interested to see if replacing it with staticPageRank(3) allows the script to succeed. Not that this is an ideal solution.

ianmilligan1 commented 8 years ago

Pinging @lintool for thoughts re: HDFS error?

ianmilligan1 commented 8 years ago

Looping in @yb1. I will show an example script for PageRank on GeoCities (just needing to doublecheck how subsetting out a smaller collection works).

Here's a script that should work, although I think working with CanadianPoliticalParties as in the original post might be an easier way to troubleshoot.

import org.warcbase.spark.rdd.RecordRDD._
import org.warcbase.spark.matchbox.RecordLoader
import org.warcbase.spark.matchbox.ExtractGraph

val recs=RecordLoader.loadArchives("/collections/webarchives/geocities/warcs/", sc)
  .keepUrlPatterns(Set("http://geocities.com/EnchantedForest/.*".r))
  .ExtractGraph()

recs.writeAsJson("nodes-cpp-all", "links-cpp-all")
youngbink commented 8 years ago

Hi @ianmilligan1 Issue has been fixed in spark 1.6.1. Please use spark 1.6.1 for page rank.

I was able to successfully run scripts below.

  1. Script mentioned at the top of this issue. (ran it like 3 times)
  2. CanadianPoliticalParties with # of iterations = 20 (default used to be 3) val graph = ExtractGraph(recs, numIter = 20)
  3. Dynamic page rank val graph = ExtractGraph(recs, true, 0.02) ( Reference: here, last param is for the tolerance. If lower number is used, result could be more accurate, but then it will take more time. )
ianmilligan1 commented 8 years ago

Will test this using Spark 1.6.1 (decided to kill the K-Means after thousands of failures). Will let you know!

@yb1: stupid question. How did you get Spark 1.6.1 to work with HDFS? What command did you use to launch it.

youngbink commented 8 years ago

Oh sorry @ianmilligan1 , I just saw your last comment.

I added the following. --master yarn-client --driver-library-path /opt/cloudera/parcels/CDH/lib/hadoop/lib/native and the full command I used was ~/spark-1.6.1-bin-hadoop2.6/bin/spark-shell --jars ~/warcbase2/target/warcbase-0.2.2-SNAPSHOT-fatjar.jar --num-executors 5 --executor-cores 5 --executor-memory 10G --driver-memory 10G --driver-library-path /opt/cloudera/parcels/CDH/lib/hadoop/lib/native --master yarn-client

Also I added the following line in order to set the environment variable in ~/.bash_profile

export HADOOP_CONF_DIR=/etc/hadoop/conf/

Then ran source ~/.bash_profile to reload it.

ianmilligan1 commented 8 years ago

Great, thanks @yb1!