Open dmmiller612 opened 6 years ago
I would love that. Cause it seems to me, that the impact of not running the computations in parallel is the reason why models could underperform due to some partitions finishing last and having therefor the biggest impact on the final model. (https://www.youtube.com/watch?v=nNrdv45O3pE at 15:00) At least this is what i understood when listening to this interesting talk.
BTW: Should all partitions of the training data be of the same size? Are there any guarantuees on how close the model performance of this async training are to the ones of normal training or some other estimates that help me get a grasp on the impact of this execution model like "Given the same initial weights, would different model building processes with the same data lead to very different final weights just because of other proccesses running on the workers which might lead to partitions finishing in different orders?".
edit Ok, I found the paper: https://papers.nips.cc/paper/4390-hogwild-a-lock-free-approach-to-parallelizing-stochastic-gradient-descent.pdf edit
PS: I love your API and the fact that you decided to make it work so seemlessly with spark pipelines.
With spark 2.4.0, Barrier Executors were added to ensure tasks run at the same time. We should add this for training in SparkFlow.