neoml-lib / neoml

Machine learning framework for both deep learning and traditional algorithms
https://www.abbyy.com/neoml/
Apache License 2.0
764 stars 126 forks source link

[NeoML] DistributedTraining uses IsDnnInferenced #1110

Open favorart opened 1 week ago

favorart commented 1 week ago

Previously, if you did a RunOnce (even on random data) before a RunAndBackward, it would no longer be a firstRun and you could send batches as you wish. So you could never learn some extra dnn. Now you can not.

All dnns must have paramBlobs initialized to run solver->Train() for all of them (at least RunOnce must be completed for each dnn for this to happen).

The solver->Train() must run for all dnns because all dnns must have the same paramBlobs in each epoch.