humphd / have-fun-with-machine-learning

An absolute beginner's guide to Machine Learning and Image Classification with Neural Networks
Other
5.07k stars 541 forks source link

error -9 when training caffe-alexnet model #17

Closed srinivasmangipudi closed 6 years ago

srinivasmangipudi commented 6 years ago

The job run for about 2 mins, but when on process#60 its crashing with the following error.see image attached.

screen shot 2018-03-21 at 5 44 41 pm

humphd commented 6 years ago

I'm not entirely sure, but my guess is that this is either your system running out of memory or a problem with it picking incorrectly between cpu vs. gpu.

Could be https://github.com/NVIDIA/DIGITS/issues/1402 ?

ln3333 commented 6 years ago

Reproduced the error on my docker box. By default docker is allocating 2G memory for the pod on my Macbook, which is insufficient in this case. Seen from DIGITS dashboard, the training is eating up ~3G memory.

For my case, increasing memory in docker preference panel works. Navigate through the docker whale icon -> preferences -> advanced -> memory, then increase accordingly.

humphd commented 6 years ago

@ln3333 thanks for this, I've added a note and pushed it. Closing.

I'm in the process of rewriting this for TensorFlow and TensorFlow.js right now in #14, so I think further debugging of DIGITS issues isn't necessary.