hkust-vgd / pointwise

Code for Pointwise Convolutional Neural Networks, CVPR 2018
http://pointwise.scenenn.net
MIT License
130 stars 26 forks source link

CUDA_TIME_OUT error in the middle of training #9

Closed Zakieh closed 5 years ago

Zakieh commented 5 years ago

dear authors,

Thank you very much for providing the code. I'm trying to train the semantic segmentation network using s3dis dataset, but in the middle of training before even one epoch ends, this error jumps up. 2018-11-06 11:32:27.452152: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:706] failed to record completion event; therefore, failed to create inter-stream dependency 2018-11-06 11:32:27.452227: E tensorflow/stream_executor/stream.cc:325] Error recording event in stream: error recording CUDA event on stream 0x7f51f5847720: CUDA_ERROR_LAUNCH_TIMEOUT: the launch timed out and was terminated; not marking stream as bad, as the Event object may be at fault. Monitor for further errors. 2018-11-06 11:32:27.452268: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_TIMEOUT: the launch timed out and was terminated 2018-11-06 11:32:27.452292: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1

Is it related to batch size?? I can't relate it to any other thing. Thank you in advance for your help.

songuke commented 5 years ago

I guess the kernel reaches time limit set by the operating system. You might want to use a faster GPU, or remove the time limit. You can also reduce the batch size to see if the problem goes away.

2018年11月7日(水) 4:53 zakieh notifications@github.com:

dear authors,

Thank you very much for providing the code. I'm trying to train the semantic segmentation network using s3dis dataset, but in the middle of training before even one epoch ends, this error jumps up. 2018-11-06 11:32:27.452152: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:706] failed to record completion event; therefore, failed to create inter-stream dependency 2018-11-06 11:32:27.452227: E tensorflow/stream_executor/stream.cc:325] Error recording event in stream: error recording CUDA event on stream 0x7f51f5847720: CUDA_ERROR_LAUNCH_TIMEOUT: the launch timed out and was terminated; not marking stream as bad, as the Event object may be at fault. Monitor for further errors. 2018-11-06 11:32:27.452268: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_TIMEOUT: the launch timed out and was terminated 2018-11-06 11:32:27.452292: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1

Is it related to batch size?? I can't relate it to any other thing. Thank you in advance for your help.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/scenenn/pointwise/issues/9, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIXsuZg6eSp-ZKGHHTPYnBR3wJ3iGOVks5useijgaJpZM4YRKT1 .

Zakieh commented 5 years ago

Thank you, removing the time limit resolved the issue. One other question though, is there anyway that you provide the pre-trained network models for scene segmentation? I'm trying to train it on S3DIS data and it would take a lifetime on my GPU!

songuke commented 5 years ago

Sorry for the late reply. I should have the pretrained model somewhere on my PC. If you really need it, I can try to find it back. I'm not very sure that it is the one that is used to produce the numbers in the paper, so it is still preferrable to train again I think.