linkedin / photon-ml

A scalable machine learning library on Apache Spark
Other
790 stars 185 forks source link

Can't run a GAME #435

Closed sverdrup-999 closed 4 years ago

sverdrup-999 commented 4 years ago

Firstly, thanks for making this! Excited to start using it.

I am attempting to train a model on a Spark (2.3) cluster as below:

spark-submit \
  --class com.linkedin.photon.ml.cli.game.training.GameTrainingDriver \
  --master yarn \
  --num-executors 49 \
  --executor-cores 7 \
  --driver-memory 35G \
  --conf "spark.driver.cores=7" \
  --executor-memory 35G \
  "/tmp/jars/photon-all_2.11-1.0.0.jar" \
  --input-data-directories "gs://some-random-bucket/training/" \
  --validation-data-directories "gs://some-random-bucket/validation/" \
  --override-output-directory true\
  --root-output-directory "gs://some-random-bucket/output/" \
  --feature-shard-configurations "name=globalShard,feature.bags=from_features|to_features" \
  --feature-shard-configurations "name=toShard,feature.bags=to_features" \
  --coordinate-configurations "name=global,feature.shard=globalShard,min.partitions=4,optimizer=LBFGS,tolerance=1.0E-4,max.iter=10,regularization=L2,reg.weights=0.1|1|10|100" \
  --coordinate-configurations "name=from,random.effect.type=from_id,feature.shard=toShard,min.partitions=4,optimizer=LBFGS,tolerance=1.0E-4,max.iter=10,regularization=L2,reg.weights=0.1|1|10|100" \
  --coordinate-update-sequence "global,from" \
  --coordinate-descent-iterations 1 \
  --training-task "LOGISTIC_REGRESSION"

The issue being faced is that the spark job will just hang at a step after reading all the partitions. It tends to do so at one of the 'count' operations. Cluster CPU goes to almost 0% usage at this point.

Examples:

Step 9:
count at RandomEffectDataset.scala:282
(kill)count at RandomEffectDataset.scala:282

----
Step 46
count at RandomEffectModel.scala:178
(kill) count at RandomEffectModel.scala:178

Eventually, the Spark app just exits and is marked as 'Finished'.

Oddly, when I reduce the size of the data (say to 1/100th of the full set), this runs to completion. The total memory of the cluster is about 2-3x the size of the total data to be processed. I tried it with 10x and it didn't work either.

The training/validation data is in Avro format with structure mimicking the one in the tutorial.

Please help me resolve this issue.

Side note, I tried to work off the tutorial example and use Scala code directly on the cluster and kept getting too many errors for classes, methods, etc.

ashelkovnykov commented 4 years ago

Hello - thanks for giving Photon ML a try.

Yes, using the API directly can be challenging - Photon definitely prioritizes Spark applications as opposed to interactive shell sessions.

With regards to your current issue: Are you running a Spark history server? Is there a way that you can check what particular Spark task is hanging and what the threads are doing at that time? How big is your dataset, the one experiencing this issue? You mention that the application will succeed with 1% of the data. Is that the limit? What % of the data can reliably be handled before you start experiencing this issue?

sverdrup-999 commented 4 years ago

@ashelkovnykov thanks for taking a look. I'll stick to running it via spark-submit!

Regarding the issue, I was able to get 50% of the data to be processed by adjusting the driver and executor overheads to be greater than the (Avro) size of the data. Haven't managed to get a full data run with similar adjusted settings yet. Full data is about 25GB (in Avro).

The task always hangs at one of the 'collect' or 'count' jobs in RandomEffectModel.scala. The threads dump doesn't show much, but the GC time seems to go up when it hangs. I don't see any memory related errors in the logs though.

Any insights will be appreciated.

sverdrup-999 commented 4 years ago

It will also be helpful to have your guidance on cluster size, settings and practices that have been used before when handling data anywhere from 25-250GB. To get a model build to finish and also have the builds run faster.

ashelkovnykov commented 4 years ago

@sverdrup-999

I'm not sure what help I can offer - without seeing the Spark logs or the application history, I don't know what to look for. I notice that you have min.partitions set to 4 for both coordinates - this could cause trouble depending on how the data is distributed across files. Try setting both to 2500 or so.

Is your "from" random effect data skewed, i.e. does one ID have vastly more data than the others?

With regards to cluster settings & practices: I don't maintain the cluster at LinkedIn, so I can't speak on this matter. It really depends on how many simultaneous applications you're going to be running, what kinds of applications, how many users you're going to be sharing the cluster with, do you own/manage the cluster machines, is the data co-located, etc. If you have specific questions, I can do my best to answer them.

sverdrup-999 commented 4 years ago

@ashelkovnykov upping the min.paritions did the trick! Thank you!