it will stop in training ,there is no error in terminal.

szm88 commented 6 years ago

hi,jeasinema.I get your code,do the next step: 1.change the config of gpu: '3,1,2,0' -> '0' # I only have gtx 1070 2.python3 setup.py build_ext --inplace # I use python3.5.2 3.cd utils python3 preprocess.py

error:there is no config,so i copy tf_voxelnet/config.py to tf_voxelnet/utils/ #is that right????? the data is from kitti object include data_object_velodyne.zip (about 29G) \image_2(12G)\label_2\voxel(about 25G) 4.python3 train.py
errror : thert is no label of testing data ,so i copy "training" to "testing" in trainpy. then : .......... train: 18/60 @ epoch:3/10 loss: 1.9014711380004883 reg_loss: 0.31266674399375916 cls_loss: 1.5888043642044067 default train ['000004'] --------------------using time: 73.70951771736145s------------------- train: 19/60 @ epoch:3/10 loss: 1.5401957035064697 reg_loss: 0.23529152572155 cls_loss: 1.3049042224884033 default train ['000001'] --------------------using time: 77.3743188381195s------------------- train: 20/60 @ epoch:3/10 loss: 1.8793950080871582 reg_loss: 0.2751219868659973 cls_loss: 1.6042730808258057 default

it will stop in this ,there is no error in terminal.

when use 4 tanx train the model, it used all the 20 cpu threads and 45G ram.it used gpu-memory 149m*8.Why it use so much cpu????? i found that gpu-util:0%,0%,0%,50%.I train another model,so i think it didn't use gpu. what's the reason?

turboxin commented 6 years ago

hi szm88, could you please kindly share how you solve this problem. I kind of get the same problem:

train: 20/18700 @ epoch:0/10 loss: 4.318506240844727 reg_loss: 2.653141498565674 cls_loss: 1.6653645038604736 default

It just stops here with no error in terminal. And the gpu-util turns down to 0% , with a high gpu memory usage of 8527/11172MB x 4 1080Ti

qianguih commented 6 years ago

I ran into the same problem. Any suggestion or comment will be appreciated. : )

dominikj93 commented 6 years ago

As far as I know, the labels for testing set are not publicly available. Thus, you cannot use the training and testing split as provided by KITTI dataset. The solution is to split training data set into smaller training and validation sets. At least that's what worked for me. Hope it helped!

qianguih commented 6 years ago

@dominikj93 Thanks for your reply. Actually, I have already splitter the training data. However, it still crashed sometimes during training.

jeasinema commented 6 years ago

@qianguih please upload the output of the terminal when you run this program. It's hard for us to determine what to go wrong with these limited informations.

BTW, @dominikj93 does provide the correct solution, sorry for not telling you that I use a split file available here.

qianguih commented 6 years ago

@jeasinema Thanks for your reply. I did use the same split file in my experiments. It will run smoothly for couple of epochs and just stop training without reporting any errors or warnings. It works well at most of the time on a 1080 Ti GPU but fails frequently on a P100 GPU. I don't have a sample output right now. I am trying to reproduce the problem and will post a sample output here when it is available. Currently, I feel like it is something wrong with the multi-thread processing in data loader.

jeasinema commented 6 years ago

@qianguih you do find a potential problem. The loaders may have competition with model. So you can try to add more workers like here.

ashishkumar-rambhatla commented 6 years ago

@jeasinema Can you share with us the code to split the kitti training data using the split files provided?

qianguih commented 6 years ago

@jeasinema Attached is a sample log. The CPU thread is still running but the GPU thread is dead. There is no error or warning. I have tried 8 workers. However, it didn't help.

log.txt

jeasinema commented 6 years ago

@qianguih Have you tried to pause the training and then restart it again? Did it still stuck here?

qianguih commented 6 years ago

@jeasinema No, I didn't try this. I tried to replace the multi-thread data loader with a normal loader. And it solved the problem, which proved that the problem does come from the multi-thread data loader.

zhanpx commented 6 years ago

@jeasinema No, I didn't try this. I tried to replace the multi-thread data loader with a normal loader. And it solved the problem, which proved that the problem does come from the multi-thread data loader.

I ran cross the same problem. Could you share your loader? Thx

tsinghua-rll / VoxelNet-tensorflow

it will stop in training ,there is no error in terminal. #11