tsinghua-rll / VoxelNet-tensorflow

A 3D object detection system for autonomous driving.
MIT License
453 stars 123 forks source link

training stops at 20/18700 of epoch 0/10 with no error in terminal #21

Closed turboxin closed 6 years ago

turboxin commented 6 years ago

Hi jeasinema, thank you for this great work!

When I run train.py, it stops here and make no more progress with no error in terminal:

train: 20/18700 @ epoch:0/10 loss: 4.318506240844727 reg_loss: 2.653141498565674 cls_loss: 1.6653645038604736 default

the gpu-util turns down to 0% , with a high gpu memory usage of 8527/11172MB x 4 1080Ti

Any help is appreciated . Thanks in advance!!

turboxin commented 6 years ago

seems that it has sth to do with summary_image, valid_loader.load() wonn't work. Any idea how to fix this?

                if is_summary_image:
                    ret = model.predict_step(
                            sess, valid_loader.load(), summary=True)
                    summary_writer.add_summary(ret[-1], iter)
HectorAnadon commented 6 years ago

It happens also to me in epoch 60 here:

                    if is_validate:
                        ret = model.validate_step(
                                sess, valid_loader.load(), summary=True)
                        summary_writer.add_summary(ret[-1], iter)
qianguih commented 6 years ago

Same problem here. Any suggestion or comment will be appreciated. : )

dominikj93 commented 6 years ago

I have answered almost the same question in issue #11