Closed PowerToThePeople111 closed 5 years ago
Update:
Having a closer look at the tasks in the Spark UI Overview, I can see that the number of records being processed varies a lot: from 100 to 260k. It is exactly the tasks that have lots of examples which are the ones that seem to take endlessly for training. Is there a possibility to get them more evenly distributed? I saw you do coalesce. Maybe I should repartition the training data in spark myself before training with the same number of partitions I used for the SparkAsyncDL model?
Ok, repartitioning as mentioned above fixes the problem of uneven distribution of examples to the workers. Just in case others have the issue too.
Hi all,
I am using spark 2.4.0, the current version of SparkFlow with the current TensorFlow python package. I have successfully run the mnist example, but seem to have problems in training another dataset to a SparkFlow model.
Sometimes, tasks seem to hang endlessly (for hours no worker finishes) while others finish fast. The last lines of the hanging worker logs look like this:
I already start the context with 30 executors with 15GB of executor memory and 10GB of executor overhead to process a dataset of approx. 100MB. The model uses the following parameters:
I experimented with different acquire locks, partition shuffles and mini batches settings. Not all produced errors, but partition shuffles seemed to have a good impact on model performance for mnist so I wanted to keep it. And when I use mini batches, I can see the loss getting reduced for later iterations in contrast to not using them where loss fluctuates way more.
Do you have ideas what the reasons could be? Would you recommend me to downgrade spark to 2.3 and tf to 1.7 or are there other possibilities that I miss?
Best,