Open dportabella opened 8 years ago
Does your application work if you launch it via spark-shell
, along lines of on a local machine:
/home/i2millig/spark-1.5.1/bin/spark-shell --driver-memory 6G --jars ~/warcbase/warcbase-core/target/warcbase-core-0.1.0-SNAPSHOT-fatjar.jar
To launch, and then :paste:
your script in.
And does this simple script work (with loadArchives
path changed accordingly):
import org.warcbase.spark.matchbox._
import org.warcbase.spark.rdd.RecordRDD._
val r = RecordLoader.loadArchives("warcbase-core/src/test/resources/arc/example.arc.gz", sc)
.keepValidPages()
.map(r => ExtractDomain(r.getUrl))
.countItems()
.take(10)
Just curious where the error is.
Yes, this works, and gives this result:
r: Array[(String, Int)] = Array((www.archive.org,132), (deadlists.com,2), (www.hideout.com.br,1))
Does RecordLoader.loadArchives
load everything in memory?
The example I gave works depending on the input file /data/sample.warc.gz
. I'll try to get the simplest input file for which it fails.
Yes, my understanding is that loadArchives
loads everything in memory – down the road, I think we'd like to explore using CDX files to be a bit more selective (i.e. the ArchiveSpark model).
We've tested on WARC files up to ~ 1.1GB, but not much bigger. Are you using a very big WARC file?
Error traces are always useful, as we're finding lots of weird things in WARC files that can occasionally break warcbase!
It seems that it does not depend on the size of the WARC file. I can successfully process CommonCrawl 1Gb files and it works fine. Here it is a sample WARC file of 295Mb, for which loadArchives
fails with OutOfMemoryError
:
https://www.dropbox.com/s/h7ing7wdgdq1x9u/www.swisslog.com.warc.gz?dl=0
Or you can rebuild this warc archive by:
wget --warc-file=www.swisslog.com --warc-max-size=500M --no-check-certificate --recursive --level=4 --reject pdf,gz,tar,zip,gif,js,css,ico,jpg,jpeg,png,tiff,mp3,mp4,mpg,mpeg,avi,rfa http://www.swisslog.com/
We do have problem with large WARC files, which I'll continue in #254.
This is weird, @dportabella – when you rebuild the WARC archive, does it work? Thanks!
I didn't understand the point of rebuilding the WARC archive and try again (what would be the insight on the result?).
Anyway, I tried and it failed with the same error.
However, while reading my own description, I noticed that I only used --executor-memory 6g
.
I tried again using --driver-memory 6G
, and this time the execution succeeded.
The input www.swisslog.com-00000.warc.gz
is 265M, and uncompressed is 346M. I tried again with --driver-memory 1G
and it failed again with same error: OutOfMemoryError: Java heap space
.
How can I know how much driver-memory
and executor-memory
do I need?
Anyway, it was my mistake that I didn't use --driver-memory
. We can close this ticket.
I had memory problems running my program, and I see that I cannot even run this very simple example:
I am running this on my local machine:
$ spark-submit --executor-memory 6g --master local[2] --class application.Test target/scala-2.10/test-assembly-0.1-SNAPSHOT.jar /data/sample.warc.gz
and I get a
java.lang.OutOfMemoryError: Java heap space
I thought that spark would take care of the memory, swapping to disk when necessary, and run ok as long as there is enough HD space (even if the input file is 1 Petabyte).
Why do I get an
OutOfMemoryError
?Does
RecordLoader.loadArchives
load everything in memory? How can I solve this problem?