Aborted at xxxxxx (unix time) SIGSEGV (@0x0) received by PID xxxx (TID 0xxxxxxxx) from PID 0; stack trace

Hiroki11x commented 7 years ago

https://github.com/slayton58/caffe2/commit/e415b74e439e67c6d5c2a6d1061c516ee3335afa のように従って色々やってみた

I want to run `caffe2/caffe2/python/examples/resnet50_trainer.py` with fp16 using P100.

Change

edit caffe2/caffe2/python/examples/resnet50_trainer.py as follows

add output_type='float16' in brew.image_input argument

also made the following changes to `caffe2/caffe2/python/models/resnet.py 'as follows.

by using caffe2.python.modeling.initializers.pFP16Initializer add pFP16Initializer in brew.conv argument

WeightInitializer=pFP16Initializer,
BiasInitializer=pFP16Initializer,

All changes are below https://github.com/rioyokotalab/models/commit/cc5f9a90a828fac4ad2b3eb403c42a4a24d42f6d

Execution

For intra-node parallel learning on a machine with four P100s, the following command is executed

python  /path-to-examples/resnet50_trainer.py  \
--train_data /path-to-ILSVRC2012-dataset/ilsvrc12_train_lmdb \
--num_gpus 4   \
--num_shards 1  \
--file_store_path . \
--image_size 224  \
--batch_size 128 \
--epoch_size 1281167   \
--num_epochs 1  \
--base_learning_rate 1.0  \
--weight_decay 0.0001 \
--num_labels=1000

Error

INFO:resnet50_trainer:Finished iteration 91/10009 of epoch 0 (400.34 images/sec)
INFO:resnet50_trainer:Training loss: 2.21322536469, accuracy: 0.21875
*** Aborted at 1499852546 (unix time) try "date -d @1499852546" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x58) received by PID 102556 (TID 0x3aff01fff1d0) from PID 88; stack trace: ***
    @     0x3fffa05b0478 ([vdso]+0x477)
    @     0x3fff8dbbe268 (unknown)
    @     0x3fff8dd0ff50 (unknown)
    @     0x3fff8dc73a80 (unknown)
    @     0x3fff8dc7502c (unknown)
    @     0x3fff8dc753bc (unknown)
    @     0x3fff8db997a0 (unknown)
    @     0x3fff8da90ccc (unknown)
    @     0x3fff8dc14310 cuStreamSynchronize
    @     0x3fff9483d120 (unknown)
    @     0x3fff9488d808 cudaStreamSynchronize
    @     0x3fff96dbe440 caffe2::CUDAContext::FinishDeviceComputation()
    @     0x3fff96dbe8a0 caffe2::Operator<>::Run()
    @     0x3fff9678bff4 caffe2::DAGNet::RunAt()
    @     0x3fff96787c98 caffe2::DAGNetBase::WorkerFunction()
    @     0x3fff9678c2a4 std::thread::_Impl<>::_M_run()
    @     0x3fff79abbdd4 (unknown)
    @     0x3fffa0558728 start_thread
    @     0x3fffa034d210 __clone
Segmentation fault

Machine environment

name	description
OS	Red Hat Enterprise Linux Server release 7.3 (Maipo)
CPU	POWER8NVL revision : 1.0 (pvr 004c 0100) ×8
GCC Compiler	gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)
GPU	Tesla P100 GPUs × 4
nvcc	release 8.0, V8.0.61
cuDNN	v6.0 (April 27, 2017), for CUDA 8.0

rioyokota commented 7 years ago

slayton58は元同僚なのでわからないことがあったら直接コンタクト取れるようにできます

Hiroki11x commented 7 years ago

@rioyokota 本当ですか？ ResNet50のtrainigがfp16でできるか？(試したことがあるか)聞いてみたいです。

Errorの原因はなんとなくわかっていて、そこよりも途中まで学習は進むのですがあまり性能が出ない(GPU使いきれていない、fp32より速くはない、SoftmaxWithLossとかがfp16用にまだ実装されていないor公開されていない)のが共通認識か知りたいのでつなげてもらいたいです(githubアカウントわかるので直接コメントしてもいい？？)

Hiroki11x commented 7 years ago

I can execute multigpu training. However I can't execute single gpu training.

INFO:resnet50_trainer:Training loss: 1.89116716385, accuracy: 0.1875
*** Aborted at 1502868249 (unix time) try "date -d @1502868249" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 72089 (TID 0x16000f9bf1b0) from PID 0; stack trace: ***
    @     0x100000050478 ([vdso]+0x477)
    @     0x10002455b908 (unknown)
    @     0x10002451c6ec (unknown)
    @     0x100024517178 (unknown)
    @     0x100012ebd98c (unknown)
    @     0x100012ce81d4 (unknown)
    @     0x100012d8a1ac (unknown)
    @     0x100012d8c4ec (unknown)
    @     0x100012e54498 (unknown)
    @     0x100012de4e88 (unknown)
    @     0x100012cf11ac (unknown)
    @     0x100012cf26f4 (unknown)
    @     0x100012be0b9c (unknown)
    @     0x100012be1378 (unknown)
    @     0x100012d83204 cuMemcpyAsync
    @     0x10000c48c72c (unknown)
    @     0x10000c461abc (unknown)
    @     0x10000c4a9f88 cudaMemcpyAsync
    @     0x10000a93ccc8 caffe2::CUDAContext::CopyBytes<>()
    @     0x10000a93e948 caffe2::Tensor<>::CopyFrom<>()
    @     0x10000a93f228 caffe2::ImageInputOp<>::CopyPrefetched()
    @     0x10000a93aa6c caffe2::PrefetchOperator<>::Run()
    @     0x100009f51924 caffe2::DAGNet::RunAt()
    @     0x100009f4cee8 caffe2::DAGNetBase::WorkerFunction()
    @     0x100009f51604 std::thread::_Impl<>::_M_run()
    @     0x10000051bdd4 (unknown)
    @     0x1000000b8728 start_thread
    @     0x10000034d210 __clone

Hiroki11x commented 7 years ago

when I remove nvprof , training is suceeded. It depends nvprof?

Hiroki11x commented 7 years ago

if I use distributed stable version https://github.com/rioyokotalab/caffe2/tree/3a2e09674920fa9ac124a4facd6ef90e4eea1b47

this problem occured.

If I use bellow commit version.

commit c59f29163a15d0ccccb4a77db07f6f1da2757b76 Author: Yangqing Jia Yangqing@users.noreply.github.com Date: Thu Aug 17 00:03:53 2017 -0700

Adios CNMEM. You will be remembered.

Summary:
As part of the cuda 9 move we have decided to deprecate the cnmem path
as it seems to be superceded by cub if one needs a memory pool.
Closes https://github.com/caffe2/caffe2/pull/1104

Differential Revision: D5647672

Pulled By: Yangqing

fbshipit-source-id: 988af5bf63e24efa1b631fd91ddb58e798ffc5c6

is also not stable for this nvprof problem.

rioyokotalab / caffe2