karpathy / neuraltalk2

Efficient Image Captioning code in Torch, runs on GPU
5.49k stars 1.26k forks source link

Segmentation fault on Jetson TX1 during training #160

Open ZahlGraf opened 7 years ago

ZahlGraf commented 7 years ago

Hi,

since I did not get the pretrained models running, I tried to train my own model on Jetson TX1. Unfortunately the training stops due to a Segmentation fault:

th train.lua -input_h5 coco/cocotalk.h5 -input_json coco/cocotalk.json -cnn_proto cnn_model/VGG_ILSVRC_16_layers_deploy.prototxt -cnn_model cnn_model/VGG_ILSVRC_16_layers.caffemodel -max_iters 1 -batch_size 1 -language_eval 1

DataLoader loading json file:   coco/cocotalk.json  
vocab size is 9567  
DataLoader loading h5 file:     coco/cocotalk.h5    
read 123287 images of size 3x256x256    
max sequence length in data is 16   
assigned 113287 images to split train   
assigned 5000 images to split val   
assigned 5000 images to split test  
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:537] Reading dangerously large protocol message.  If the message turns out to be larger than 1073741824 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:78] The total number of bytes read was 553432081
Successfully loaded cnn_model/VGG_ILSVRC_16_layers.caffemodel
conv1_1: 64 3 3 3
conv1_2: 64 64 3 3
conv2_1: 128 64 3 3
conv2_2: 128 128 3 3
conv3_1: 256 128 3 3
conv3_2: 256 256 3 3
conv3_3: 256 256 3 3
conv4_1: 512 256 3 3
conv4_2: 512 512 3 3
conv4_3: 512 512 3 3
conv5_1: 512 512 3 3
conv5_2: 512 512 3 3
conv5_3: 512 512 3 3
fc6: 1 1 25088 4096
fc7: 1 1 4096 4096
fc8: 1 1 4096 1000
converting first layer conv filters from BGR to RGB...  
Segmentation fault

Sometimes the Segmentation fault happens earlier (after loaded the cnn-model). I have the feeling, that this is related to a out of memory issue, since shortly before the Fault appears the RAM usage was over 90%. However normally torch starts to use the swap file, when running out of memory, this does not happen here...

I also tried to train on CPU (in case the GPU memory cannot be swapped), but this does also not help.

Has anyone already tried to train this on Jetson TX1?

kaisark commented 6 years ago

I wouldn't recommend training on edge devices like TX1. Generally speaking, train on cloud and run on edge (if possible)...