rioyokotalab / caffe2

Caffe2 is a lightweight, modular, and scalable deep learning framework.
https://caffe2.ai
Other
2 stars 0 forks source link

fp16 training problem #23

Open Hiroki11x opened 7 years ago

Hiroki11x commented 7 years ago

I try to train resnet50 in fp16 on commit of 0f72d2508c5d5c295c1cd54aae1460a22ea994ea

INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 125/128 of epoch 0 (201.81 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 126/128 of epoch 0 (202.05 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 127/128 of epoch 0 (201.43 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
INFO:resnet50_trainer:Finished iteration 128/128 of epoch 0 (201.80 images/sec)
INFO:resnet50_trainer:Training loss: nan, accuracy: 1.0
Traceback (most recent call last):
  File "/home/hiroki11/latest_caffe2/caffe2/caffe2/python/examples/resnet50_trainer.py", line 500, in <module>
    main()
  File "/home/hiroki11/latest_caffe2/caffe2/caffe2/python/examples/resnet50_trainer.py", line 496, in main
    Train(args)
  File "/home/hiroki11/latest_caffe2/caffe2/caffe2/python/examples/resnet50_trainer.py", line 421, in Train
    explog
  File "/home/hiroki11/latest_caffe2/caffe2/caffe2/python/examples/resnet50_trainer.py", line 188, in RunEpoch
    assert loss < 40, "Exploded gradients :("
AssertionError: Exploded gradients :(

this experiment is wxecute with 10 category datasets . So I used --num_labels 10

Hiroki11x commented 7 years ago

It is not depend on nvprof I can get .nvp profiling file, however AssertionError is output

Hiroki11x commented 7 years ago

fp16 training is implemented by https://github.com/caffe2/caffe2/commit/0f72d2508c5d5c295c1cd54aae1460a22ea994ea

Hiroki11x commented 7 years ago

on ReedBush (x86)

It seems work out .

python /path-to/resnet50_trainer.py \
 --train_data /path-to/ilsvrc12_train_lmdb --gpus 0 \
 --batch_size 32 \
 --epoch_size 4096 \
 --dtype float16 \
 --num_epoch 1 \
 2>&1 | tee
./job.sh 
INFO:resnet50_trainer:Running on GPUs: [0]
INFO:resnet50_trainer:Using epoch size: 4096
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Add initial parameter sync
WARNING:data_parallel_model:------- DEPRECATED API, please use data_parallel_model.OptimizeGradientMemory() ----- 
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Memonger memory optimization took 0.0791149139404 secs
INFO:resnet50_trainer:Starting epoch 0/1
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
INFO:resnet50_trainer:Finished iteration 1/128 of epoch 0 (8.64 images/sec)
INFO:resnet50_trainer:Training loss: 7.5533246994, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 2/128 of epoch 0 (200.13 images/sec)
INFO:resnet50_trainer:Training loss: 41.9923553467, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 3/128 of epoch 0 (141.97 images/sec)
INFO:resnet50_trainer:Training loss: 17.8341007233, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 4/128 of epoch 0 (144.55 images/sec)
INFO:resnet50_trainer:Training loss: 22.7966308594, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 5/128 of epoch 0 (140.14 images/sec)
INFO:resnet50_trainer:Training loss: 20.7545146942, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 6/128 of epoch 0 (138.45 images/sec)
INFO:resnet50_trainer:Training loss: 18.2482261658, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 7/128 of epoch 0 (141.68 images/sec)
INFO:resnet50_trainer:Training loss: 13.1798582077, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 8/128 of epoch 0 (140.92 images/sec)
INFO:resnet50_trainer:Training loss: 10.557349205, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 9/128 of epoch 0 (135.75 images/sec)
INFO:resnet50_trainer:Training loss: 10.103843689, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 10/128 of epoch 0 (137.85 images/sec)
INFO:resnet50_trainer:Training loss: 9.42764949799, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 11/128 of epoch 0 (141.04 images/sec)
INFO:resnet50_trainer:Training loss: 8.75890922546, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 12/128 of epoch 0 (133.79 images/sec)
INFO:resnet50_trainer:Training loss: 9.25703716278, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 13/128 of epoch 0 (143.82 images/sec)
INFO:resnet50_trainer:Training loss: 9.45565891266, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 14/128 of epoch 0 (136.74 images/sec)
INFO:resnet50_trainer:Training loss: 8.72623538971, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 15/128 of epoch 0 (139.37 images/sec)
INFO:resnet50_trainer:Training loss: 8.18592834473, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 16/128 of epoch 0 (139.71 images/sec)
INFO:resnet50_trainer:Training loss: 8.34669685364, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 17/128 of epoch 0 (137.09 images/sec)
INFO:resnet50_trainer:Training loss: 8.39622783661, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 18/128 of epoch 0 (140.93 images/sec)
INFO:resnet50_trainer:Training loss: 9.08880615234, accuracy: 0.03125
INFO:resnet50_trainer:Finished iteration 19/128 of epoch 0 (132.99 images/sec)
INFO:resnet50_trainer:Training loss: 8.13797187805, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 20/128 of epoch 0 (140.68 images/sec)
INFO:resnet50_trainer:Training loss: 8.00732421875, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 21/128 of epoch 0 (147.09 images/sec)
INFO:resnet50_trainer:Training loss: 8.09602737427, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 22/128 of epoch 0 (135.89 images/sec)
INFO:resnet50_trainer:Training loss: 8.01212882996, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 23/128 of epoch 0 (138.55 images/sec)
INFO:resnet50_trainer:Training loss: 7.52052927017, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 24/128 of epoch 0 (140.79 images/sec)
INFO:resnet50_trainer:Training loss: 7.57909154892, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 25/128 of epoch 0 (138.27 images/sec)
INFO:resnet50_trainer:Training loss: 7.48513650894, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 26/128 of epoch 0 (137.35 images/sec)
INFO:resnet50_trainer:Training loss: 7.38407039642, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 27/128 of epoch 0 (124.33 images/sec)
INFO:resnet50_trainer:Training loss: 7.52470636368, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 28/128 of epoch 0 (158.89 images/sec)
INFO:resnet50_trainer:Training loss: 7.45266246796, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 29/128 of epoch 0 (140.77 images/sec)
INFO:resnet50_trainer:Training loss: 7.15103578568, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 30/128 of epoch 0 (135.27 images/sec)
INFO:resnet50_trainer:Training loss: 7.13074398041, accuracy: 0.03125
INFO:resnet50_trainer:Finished iteration 31/128 of epoch 0 (140.33 images/sec)
INFO:resnet50_trainer:Training loss: 6.9314789772, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 32/128 of epoch 0 (138.16 images/sec)
INFO:resnet50_trainer:Training loss: 6.98183584213, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 33/128 of epoch 0 (134.46 images/sec)
INFO:resnet50_trainer:Training loss: 7.18748807907, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 34/128 of epoch 0 (130.84 images/sec)
INFO:resnet50_trainer:Training loss: 7.13615083694, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 35/128 of epoch 0 (151.15 images/sec)
INFO:resnet50_trainer:Training loss: 7.19122552872, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 36/128 of epoch 0 (140.65 images/sec)
INFO:resnet50_trainer:Training loss: 7.01063156128, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 37/128 of epoch 0 (139.61 images/sec)
INFO:resnet50_trainer:Training loss: 7.13718032837, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 38/128 of epoch 0 (136.76 images/sec)
INFO:resnet50_trainer:Training loss: 6.93132448196, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 39/128 of epoch 0 (143.61 images/sec)
INFO:resnet50_trainer:Training loss: 7.16888713837, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 40/128 of epoch 0 (122.54 images/sec)
INFO:resnet50_trainer:Training loss: 7.03047513962, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 41/128 of epoch 0 (155.73 images/sec)
INFO:resnet50_trainer:Training loss: 7.0777554512, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 42/128 of epoch 0 (146.91 images/sec)
INFO:resnet50_trainer:Training loss: 7.09129714966, accuracy: 0.0