Open Hiroki11x opened 7 years ago
It is not depend on nvprof
I can get .nvp profiling file, however AssertionError is output
fp16 training is implemented by https://github.com/caffe2/caffe2/commit/0f72d2508c5d5c295c1cd54aae1460a22ea994ea
on ReedBush (x86)
It seems work out .
python /path-to/resnet50_trainer.py \
--train_data /path-to/ilsvrc12_train_lmdb --gpus 0 \
--batch_size 32 \
--epoch_size 4096 \
--dtype float16 \
--num_epoch 1 \
2>&1 | tee
./job.sh
INFO:resnet50_trainer:Running on GPUs: [0]
INFO:resnet50_trainer:Using epoch size: 4096
INFO:data_parallel_model:Parallelizing model for devices: [0]
INFO:data_parallel_model:Create input and model training operators
INFO:data_parallel_model:Model for GPU : 0
INFO:data_parallel_model:Adding gradient operators
INFO:data_parallel_model:Add gradient all-reduces for SyncSGD
INFO:data_parallel_model:Post-iteration operators for updating params
INFO:data_parallel_model:Calling optimizer builder function
INFO:data_parallel_model:Add initial parameter sync
WARNING:data_parallel_model:------- DEPRECATED API, please use data_parallel_model.OptimizeGradientMemory() -----
WARNING:memonger:NOTE: Executing memonger to optimize gradient memory
INFO:memonger:Memonger memory optimization took 0.0791149139404 secs
INFO:resnet50_trainer:Starting epoch 0/1
Ignoring @/caffe2/caffe2/contrib/nccl:nccl_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/contrib/gloo:gloo_ops_gpu as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:file_store_handler_ops as it is not a valid file.
Ignoring @/caffe2/caffe2/distributed:redis_store_handler_ops as it is not a valid file.
INFO:resnet50_trainer:Finished iteration 1/128 of epoch 0 (8.64 images/sec)
INFO:resnet50_trainer:Training loss: 7.5533246994, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 2/128 of epoch 0 (200.13 images/sec)
INFO:resnet50_trainer:Training loss: 41.9923553467, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 3/128 of epoch 0 (141.97 images/sec)
INFO:resnet50_trainer:Training loss: 17.8341007233, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 4/128 of epoch 0 (144.55 images/sec)
INFO:resnet50_trainer:Training loss: 22.7966308594, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 5/128 of epoch 0 (140.14 images/sec)
INFO:resnet50_trainer:Training loss: 20.7545146942, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 6/128 of epoch 0 (138.45 images/sec)
INFO:resnet50_trainer:Training loss: 18.2482261658, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 7/128 of epoch 0 (141.68 images/sec)
INFO:resnet50_trainer:Training loss: 13.1798582077, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 8/128 of epoch 0 (140.92 images/sec)
INFO:resnet50_trainer:Training loss: 10.557349205, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 9/128 of epoch 0 (135.75 images/sec)
INFO:resnet50_trainer:Training loss: 10.103843689, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 10/128 of epoch 0 (137.85 images/sec)
INFO:resnet50_trainer:Training loss: 9.42764949799, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 11/128 of epoch 0 (141.04 images/sec)
INFO:resnet50_trainer:Training loss: 8.75890922546, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 12/128 of epoch 0 (133.79 images/sec)
INFO:resnet50_trainer:Training loss: 9.25703716278, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 13/128 of epoch 0 (143.82 images/sec)
INFO:resnet50_trainer:Training loss: 9.45565891266, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 14/128 of epoch 0 (136.74 images/sec)
INFO:resnet50_trainer:Training loss: 8.72623538971, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 15/128 of epoch 0 (139.37 images/sec)
INFO:resnet50_trainer:Training loss: 8.18592834473, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 16/128 of epoch 0 (139.71 images/sec)
INFO:resnet50_trainer:Training loss: 8.34669685364, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 17/128 of epoch 0 (137.09 images/sec)
INFO:resnet50_trainer:Training loss: 8.39622783661, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 18/128 of epoch 0 (140.93 images/sec)
INFO:resnet50_trainer:Training loss: 9.08880615234, accuracy: 0.03125
INFO:resnet50_trainer:Finished iteration 19/128 of epoch 0 (132.99 images/sec)
INFO:resnet50_trainer:Training loss: 8.13797187805, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 20/128 of epoch 0 (140.68 images/sec)
INFO:resnet50_trainer:Training loss: 8.00732421875, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 21/128 of epoch 0 (147.09 images/sec)
INFO:resnet50_trainer:Training loss: 8.09602737427, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 22/128 of epoch 0 (135.89 images/sec)
INFO:resnet50_trainer:Training loss: 8.01212882996, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 23/128 of epoch 0 (138.55 images/sec)
INFO:resnet50_trainer:Training loss: 7.52052927017, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 24/128 of epoch 0 (140.79 images/sec)
INFO:resnet50_trainer:Training loss: 7.57909154892, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 25/128 of epoch 0 (138.27 images/sec)
INFO:resnet50_trainer:Training loss: 7.48513650894, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 26/128 of epoch 0 (137.35 images/sec)
INFO:resnet50_trainer:Training loss: 7.38407039642, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 27/128 of epoch 0 (124.33 images/sec)
INFO:resnet50_trainer:Training loss: 7.52470636368, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 28/128 of epoch 0 (158.89 images/sec)
INFO:resnet50_trainer:Training loss: 7.45266246796, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 29/128 of epoch 0 (140.77 images/sec)
INFO:resnet50_trainer:Training loss: 7.15103578568, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 30/128 of epoch 0 (135.27 images/sec)
INFO:resnet50_trainer:Training loss: 7.13074398041, accuracy: 0.03125
INFO:resnet50_trainer:Finished iteration 31/128 of epoch 0 (140.33 images/sec)
INFO:resnet50_trainer:Training loss: 6.9314789772, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 32/128 of epoch 0 (138.16 images/sec)
INFO:resnet50_trainer:Training loss: 6.98183584213, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 33/128 of epoch 0 (134.46 images/sec)
INFO:resnet50_trainer:Training loss: 7.18748807907, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 34/128 of epoch 0 (130.84 images/sec)
INFO:resnet50_trainer:Training loss: 7.13615083694, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 35/128 of epoch 0 (151.15 images/sec)
INFO:resnet50_trainer:Training loss: 7.19122552872, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 36/128 of epoch 0 (140.65 images/sec)
INFO:resnet50_trainer:Training loss: 7.01063156128, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 37/128 of epoch 0 (139.61 images/sec)
INFO:resnet50_trainer:Training loss: 7.13718032837, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 38/128 of epoch 0 (136.76 images/sec)
INFO:resnet50_trainer:Training loss: 6.93132448196, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 39/128 of epoch 0 (143.61 images/sec)
INFO:resnet50_trainer:Training loss: 7.16888713837, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 40/128 of epoch 0 (122.54 images/sec)
INFO:resnet50_trainer:Training loss: 7.03047513962, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 41/128 of epoch 0 (155.73 images/sec)
INFO:resnet50_trainer:Training loss: 7.0777554512, accuracy: 0.0
INFO:resnet50_trainer:Finished iteration 42/128 of epoch 0 (146.91 images/sec)
INFO:resnet50_trainer:Training loss: 7.09129714966, accuracy: 0.0
I try to train resnet50 in fp16 on commit of 0f72d2508c5d5c295c1cd54aae1460a22ea994ea
this experiment is wxecute with 10 category datasets . So I used
--num_labels 10