Closed sverdrup-999 closed 4 years ago
Hello - thanks for giving Photon ML a try.
Yes, using the API directly can be challenging - Photon definitely prioritizes Spark applications as opposed to interactive shell sessions.
With regards to your current issue: Are you running a Spark history server? Is there a way that you can check what particular Spark task is hanging and what the threads are doing at that time? How big is your dataset, the one experiencing this issue? You mention that the application will succeed with 1% of the data. Is that the limit? What % of the data can reliably be handled before you start experiencing this issue?
@ashelkovnykov thanks for taking a look. I'll stick to running it via spark-submit!
Regarding the issue, I was able to get 50% of the data to be processed by adjusting the driver and executor overheads to be greater than the (Avro) size of the data. Haven't managed to get a full data run with similar adjusted settings yet. Full data is about 25GB (in Avro).
The task always hangs at one of the 'collect' or 'count' jobs in RandomEffectModel.scala. The threads dump doesn't show much, but the GC time seems to go up when it hangs. I don't see any memory related errors in the logs though.
Any insights will be appreciated.
It will also be helpful to have your guidance on cluster size, settings and practices that have been used before when handling data anywhere from 25-250GB. To get a model build to finish and also have the builds run faster.
@sverdrup-999
I'm not sure what help I can offer - without seeing the Spark logs or the application history, I don't know what to look for. I notice that you have min.partitions
set to 4 for both coordinates - this could cause trouble depending on how the data is distributed across files. Try setting both to 2500 or so.
Is your "from" random effect data skewed, i.e. does one ID have vastly more data than the others?
With regards to cluster settings & practices: I don't maintain the cluster at LinkedIn, so I can't speak on this matter. It really depends on how many simultaneous applications you're going to be running, what kinds of applications, how many users you're going to be sharing the cluster with, do you own/manage the cluster machines, is the data co-located, etc. If you have specific questions, I can do my best to answer them.
@ashelkovnykov upping the min.paritions
did the trick! Thank you!
Firstly, thanks for making this! Excited to start using it.
I am attempting to train a model on a Spark (2.3) cluster as below:
The issue being faced is that the spark job will just hang at a step after reading all the partitions. It tends to do so at one of the 'count' operations. Cluster CPU goes to almost 0% usage at this point.
Examples:
Eventually, the Spark app just exits and is marked as 'Finished'.
Oddly, when I reduce the size of the data (say to 1/100th of the full set), this runs to completion. The total memory of the cluster is about 2-3x the size of the total data to be processed. I tried it with 10x and it didn't work either.
The training/validation data is in Avro format with structure mimicking the one in the tutorial.
Please help me resolve this issue.
Side note, I tried to work off the tutorial example and use Scala code directly on the cluster and kept getting too many errors for classes, methods, etc.