Closed Toni-Chan closed 5 years ago
The problem is a multithreading bug in Intel MKL that originates in SparseSurvivalNllLayer.UpdateOutput(). To fix this, comment out all #pragma omp parallel for lines in sparse_survival_nll_layer.cpp. The model will then be able run through testing to completion.
Thanks. Removing all #pragma parallel signals solves this problem.
In the first iteration of training the program suddenly exits with no error messages in the middle of iterating feed-forward computation. I was running run_small.sh after compiling using make, using the datasets supplied within the package. The program stops here: main.cpp: mainloop(), line 672-685
_nngraph.cpp(in graphnnbase): void NNGraph<mode, Dtype>::FeedForward(std::map<std::string, IMatrix<mode, Dtype>* > input, Phase phase), line 23-50
_param_layer.h(in graphnnbase): class ParamLayer, virtual void UpdateOutput(std::vector< ILayer<mode, Dtype>* >& operands, Phase phase)
Symptoms
It is the first ever iteration executed.
I have tracked the numbers of iterations that has been run. It stops in the middle. The "Running Here +id" is my tracking where the program stops. It stops here every time. It seems not because of the asserts. For dependencies I am using Intel mkl with compilers and libraries at 2019.1.144, parallel studio XE at 2019.1.053.