lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Spark job taking too long #170

Closed jrwiebe closed 8 years ago

jrwiebe commented 8 years ago

I've had a simple Spark job running for 7 hours now, when it should have been completed in minutes. (I didn't realize quite how slowly it was going at first; later I became curious to see how long it would take.)

I ran the following from the Spark shell:

import org.warcbase.spark.matchbox.RecordLoader
import org.warcbase.spark.rdd.RecordRDD._

RecordLoader.loadArc("/collections/webarchives/CanadianPoliticalParties/arc", sc)
  .keepValidPages()
  .keepDomains(Set("greenparty.ca"))
  .map(r => (r.getCrawldate, r.getDomain, r.getUrl, r.getRawBodyContent))
  .saveAsTextFile("cpp.greenparty-arc-plaintext")

The monitoring web app says it's running on 2 executors, and that the "active stage" is saveAsTextFile. If I'm reading the stats correctly, almost all the run time has involved output.

Obviously something is going wrong here.

lintool commented 8 years ago

Has it run before? How much data is in the directory? Try extracting just (r.getCrawldate, r.getDomain, r.getUrl) as a test?

ianmilligan1 commented 8 years ago

I think Jeremy's running it on all the CPP ARCs, so ~ 180GB. I'll keep an eye on this - I was thinking of doing a test w/ a collection of about 45GB of WARCs on our test server once another job finishes.

lintool commented 8 years ago

2 executors is pretty skimpy on 180 GB :)

jrwiebe commented 8 years ago

My understanding is Spark automatically tries to set the number of partitions based on what's available in the cluster. Should I be setting this number manually in textFile()?

On Mon, Nov 23, 2015 at 9:02 PM, Jimmy Lin notifications@github.com wrote:

2 executors is pretty skimpy on 180 GB :)

— Reply to this email directly or view it on GitHub https://github.com/lintool/warcbase/issues/170#issuecomment-159130281.

ianmilligan1 commented 8 years ago

This would be worth documenting – I've run the exact same script on our server at York University, and it completed much quicker (~ hour or so).

lintool commented 8 years ago

You can set the number of executors via the --num-executors flag when you start up spark shell. Check out:

http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/ http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

There are lots of articles on setting parameters if you search online...

ianmilligan1 commented 8 years ago

Interesting - reading up on this now.

So we could pass num-executors to spark-shell, or possibly enable spark.dynamicAllocation.enabled in the configuration file (or pass it at runtime). Is there a pro or con or allowing dynamic allocation vs. manually setting the number of cores? I think if dynamic allocation sounds good, it'd be the easiest for endusers.

This is from the part-2 link above, Jimmy.

lintool commented 8 years ago

Both have pros and cons. In a cluster with multiple users, you might want each to use num-executors explicitly so no single user hogs all the resources.

ruebot commented 8 years ago

Naive question; Is num-executors tied directly to CPU cores? One executor per core? Or is it something else?

jrwiebe commented 8 years ago

@ruebot: My understanding is num-executors is the total number of executors distributed across the cluster. executor-cores is the number of cores to use on each executor; I think you want to set this to something less than the number of cores per CPU.

Our cluster has 25 compute nodes each having two 6-core CPUs. I just tried running the same script as above with --executor-cores 5 --num-executors 75 and it took just 3 minutes instead of 7 hours. Interestingly, it was the last few (six?) tasks that took most of the last minute. I'm sure one could tune it better yet, but I'd say that's pretty good!

ruebot commented 8 years ago

@jrwiebe ah, good to know! Thanks for the explanation :smile:

lintool commented 8 years ago

I usually just do one core per executor and then set number of executors to number of cores. Typically, you'd allocate multiple cores per executor if you have a static resource (e.g., large in-memory data file) that needs sharing.

ianmilligan1 commented 8 years ago

Just to clarify - as I'd like to put this into the documentation - if say I'm on a 8-core EC2 instance, I could pass it as:

spark-shell --jars target/warcbase-0.1.0-SNAPSHOT-fatjar.jar --num-executors 8

If that sounds reasonable, I say we can close this and I'll update documentation.

ianmilligan1 commented 8 years ago

@jrwiebe Just so this doesn't hang here too long: when you've got a chance, can you put in the command that you used to get this completed in like two or three minutes (as mentioned in person). Then I'll document it up and close.

jrwiebe commented 8 years ago

@ianmilligan1: I ran a few tests to get a better sense of how these parameters affect total execution time. Below are my results. The last two columns come from the Spark shell web UI; the reason "Executors" is sometimes one more than the specified number is because one executor is actually the driver, not a worker node.

num-executors executor-cores Executors Duration
25 5 26 14 min
75 1 76 17 min
75 5 66 4.7 min
300 1 240 7.8 min
300 5 66 4.7 min
300 10 33 3.8 min

Different jobs will benefit differently from parallelization, so we can't consider these numbers ideal for all jobs. Also, sometimes --executor-memory might make a big difference.

I chose to go as high as 10 cores, since each of the 25 compute nodes on the cluster has 2*6 cores. This seems greedy if other people are trying to get work done, though, so I'd suggest a maximum closer to 5.

The command to run Warcbase through Spark shell with these parameters is: spark-shell --jars target/warcbase-0.1.0-SNAPSHOT-fatjar.jar --num-executors 75 --executor-cores 5

ianmilligan1 commented 8 years ago

This is fascinating - thanks very much @jrwiebe for this fantastic explanation and experimentation.

When you've got a chance, can you document in Installing and Running Spark under OS X – I think you could describe this better than I could. Just a brief note under "Running in the Spark Shell."

lintool commented 8 years ago

If I understand correctly, @jrwiebe 's experiments were on Trantor... so it wouldn't really help in the general case because most people don't have clusters... perhaps he can repeat the same experiments on the York server?

ianmilligan1 commented 8 years ago

Good point. I think @jrwiebe's laptop would be a good test bed too. if I find some free cycles this week I'll test too.

jrwiebe commented 8 years ago

These settings only have an effect on cluster setups, so there's no point in testing on my laptop. I don't have a copy of the data anyway. For the record, on the York machine it took 21 minutes.

ruebot commented 8 years ago

That explains the Nagios warnings :smile:

lintool commented 8 years ago

@jrwiebe does the num-executors and executor-cores settings have any effect on a single server?

jrwiebe commented 8 years ago

@lintool: They have no effect on a single server.

@ruebot: Oops!

@ianmilligan1: I don't think it would be useful to add any notes on this to the OS X instructions, since it seems highly unlikely that anyone would try to set up a cluster under OS X. I just added a page to the wiki, though, for users lucky enough to have access to a cluster. With that, I'll close this issue.

ruebot commented 8 years ago

@jrwiebe you're all good!

manisha803 commented 7 years ago

I am facing the same issue. I did some research. It is mainly because you have huge no. of file in target folder. If i run same job with new target folder - it takes less than 5 minutes. If I run in regular target folder where i have 2 year of history data. It is taking 80 minutes

manisha803 commented 7 years ago

I still couldn't find solution. If anyone knows