lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

java.lang.OutOfMemoryError: Java heap space #246

Open dportabella opened 8 years ago

dportabella commented 8 years ago

I had memory problems running my program, and I see that I cannot even run this very simple example:

package application

import org.apache.spark._
import org.warcbase.spark.matchbox.RecordLoader
import org.warcbase.spark.rdd.RecordRDD._

object Test {
  def main(args: Array[String]) 
    val in = if (args.length > 0) args(0) else "/data/sample.warc.gz"
    val conf = new SparkConf().setAppName("Test")
    val spark = new SparkContext(conf)

    val r = RecordLoader.loadArchives(in, spark)
      .keepValidPages()
      .count()

    println(s"result: $r")

    spark.stop()
  }
}

I am running this on my local machine:

$ spark-submit --executor-memory 6g --master local[2] --class application.Test target/scala-2.10/test-assembly-0.1-SNAPSHOT.jar /data/sample.warc.gz

and I get a java.lang.OutOfMemoryError: Java heap space

I thought that spark would take care of the memory, swapping to disk when necessary, and run ok as long as there is enough HD space (even if the input file is 1 Petabyte).

Why do I get an OutOfMemoryError?

Does RecordLoader.loadArchives load everything in memory? How can I solve this problem?

ianmilligan1 commented 8 years ago

Does your application work if you launch it via spark-shell, along lines of on a local machine:

/home/i2millig/spark-1.5.1/bin/spark-shell --driver-memory 6G --jars ~/warcbase/warcbase-core/target/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar

To launch, and then :paste: your script in.

And does this simple script work (with loadArchives path changed accordingly):

import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._

val r = RecordLoader.loadArchives("warcbase-core/src/test/resources/arc/example.arc.gz", sc)
  .keepValidPages()
  .map(r => ExtractDomain(r.getUrl))
  .countItems()
  .take(10)

Just curious where the error is.

dportabella commented 8 years ago

Yes, this works, and gives this result: r: Array[(String, Int)] = Array((www.archive.org,132), (deadlists.com,2), (www.hideout.com.br,1))

Does RecordLoader.loadArchives load everything in memory?

The example I gave works depending on the input file /data/sample.warc.gz. I'll try to get the simplest input file for which it fails.

ianmilligan1 commented 8 years ago

Yes, my understanding is that loadArchives loads everything in memory – down the road, I think we'd like to explore using CDX files to be a bit more selective (i.e. the ArchiveSpark model).

We've tested on WARC files up to ~ 1.1GB, but not much bigger. Are you using a very big WARC file?

Error traces are always useful, as we're finding lots of weird things in WARC files that can occasionally break warcbase!

dportabella commented 8 years ago

It seems that it does not depend on the size of the WARC file. I can successfully process CommonCrawl 1Gb files and it works fine. Here it is a sample WARC file of 295Mb, for which loadArchives fails with OutOfMemoryError:

https://www.dropbox.com/s/h7ing7wdgdq1x9u/www.swisslog.com.warc.gz?dl=0

Or you can rebuild this warc archive by: wget --warc-file=www.swisslog.com --warc-max-size=500M --no-check-certificate --recursive --level=4 --reject pdf,gz,tar,zip,gif,js,css,ico,jpg,jpeg,png,tiff,mp3,mp4,mpg,mpeg,avi,rfa http://www.swisslog.com/

ianmilligan1 commented 7 years ago

We do have problem with large WARC files, which I'll continue in #254.

This is weird, @dportabella – when you rebuild the WARC archive, does it work? Thanks!

dportabella commented 7 years ago

I didn't understand the point of rebuilding the WARC archive and try again (what would be the insight on the result?).

Anyway, I tried and it failed with the same error.

However, while reading my own description, I noticed that I only used --executor-memory 6g. I tried again using --driver-memory 6G, and this time the execution succeeded.

The input www.swisslog.com-00000.warc.gz is 265M, and uncompressed is 346M. I tried again with --driver-memory 1G and it failed again with same error: OutOfMemoryError: Java heap space.

How can I know how much driver-memory and executor-memory do I need?

Anyway, it was my mistake that I didn't use --driver-memory. We can close this ticket.