When the ”train.py” file is running, the GPU runs up to 100%, and then the computer is black screen, and the program crashes.

Allan-Xu commented 5 years ago

Dear author

Here are my hardware specs: GPU: GeForce GTX 1080 Ti 16GB RAM: 32GB Here are my software specs: Python 3.6.8 TensorFlow-GPU 1.13.1 Cuda 10.0 cudnn 7.6 NIVDIA Drivers 430.2

I have a mistake in my project! When the ”train.py” file is running, the GPU runs up to 100%, and then the computer is black screen, and the program crashes. I changed the input to two Point Cloud files, the program also crashed, I do not know what went wrong. Here's my run error: (pcc) D:\xu\pcc\pcc_geo_cnn-master\src>python train.py "../data/ModelNet10_pc_64/**/*.ply" ../models/ModelNet10_pc_64_000001 --resolution 64 --lmbda 0.000001 2019-07-25 20:45:48.761 INFO pc_io - load_points: Loading PCs into memory (parallel reading) 50%|███████████████████████████████████████████████████████████████████████████████████████████████████▌ 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████ ████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:14<00:00, 7.34s/it] 2019-07-25 20:46:03.677 INFO estimator - __init__: Using config: {'_model_dir': '../models/ModelNet10_pc_64_000001', '_tf_random_seed': 42, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_s ession_config': allow_soft_placement: true graph_options { rewrite_options { meta_optimizer_iterations: ONE } } , '_keep_checkpoint_max': 50, '_keep_checkpoint_every_n_hours': 1, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x0000023C7ABFFBE0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_n um_ps_replicas': 0, '_num_worker_replicas': 1} 2019-07-25 20:46:03.692 INFO estimator_training - should_run_distribute_coordinator: Not using Distribute Coordinator. 2019-07-25 20:46:03.692 INFO training - run: Running training and evaluation locally (non-distributed). 2019-07-25 20:46:03.692 INFO training - run_local: Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_che ckpoints_secs 600. 2019-07-25 20:46:03.708 WARNING deprecation - new_func: From C:\Users\2016\Anaconda3\envs\pcc\lib\site-packages\tensorflow\python\framework\op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and w ill be removed in a future version. Instructions for updating: Colocations handled automatically by placer. 2019-07-25 20:46:03.723 WARNING deprecation - new_func: From C:\Users\2016\Anaconda3\envs\pcc\lib\site-packages\tensorflow\python\data\ops\dataset_ops.py:429: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be re moved in a future version. Instructions for updating: tf.py_func is deprecated in TF V2. Instead, use tf.py_function, which takes a python function which manipulates tf eager tensors instead of numpy arrays. It's easy to convert a tf eager tensor to an ndarray (just call tensor.numpy()) but having access to eager tensors meanstf.py_function`s can use accelerators such as GPUs as well as being differentiable using a gradient tape.

2019-07-25 20:46:03.848 INFO estimator - _call_model_fn: Calling model_fn. 2019-07-25 20:46:05.817 INFO estimator - _call_model_fn: Done calling model_fn. 2019-07-25 20:46:05.817 INFO basic_session_run_hooks - init: Create CheckpointSaverHook. 2019-07-25 20:46:08.941 INFO monitored_session - finalize: Graph was finalized. 2019-07-25 20:46:08.945498: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2 2019-07-25 20:46:09.201974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.721 pciBusID: 0000:04:00.0 totalMemory: 11.00GiB freeMemory2019-07-25 20:46:09.208053: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0 2019-07-25 20:46:09.8629702019-07-25 20:46:09.864114: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 2019-07-25 20:46:09.866446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N 2019-07-25 20:46:09.868689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8786 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pc i bus id: 0000:04:00.0, compute capability: 6.1) 2019-07-25 20:46:09.879 WARNING deprecation - new_func: From C:\Users\2016\Anaconda3\envs\pcc\lib\site-packages\tensorflow\python\training\saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is depr ecated and will be removed in a future version. Instructions for updating: Use standard file APIs to check for files with this prefix. 2019-07-25 20:46:09.879 INFO saver - restore: Restoring parameters from ../models/ModelNet10_pc_64_000001\model.ckpt-0 2019-07-25 20:46:10.410 WARNING deprecation - new_func: From C:\Users\2016\Anaconda3\envs\pcc\lib\site-packages\tensorflow\python\training\saver.py:1070: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version. Instructions for updating: Use standard file utilities to get mtimes. 2019-07-25 20:46:10.503 INFO session_manager - _try_run_local_init_op: Running local_init_op. 2019-07-25 20:46:10.535 INFO session_manager - _try_run_local_init_op: Done running local_init_op. 2019-07-25 20:46:15.081 INFO basic_session_run_hooks - _save: Saving checkpoints for 0 into ../models/ModelNet10_pc_64_000001\model.ckpt. 2019-07-25 20:46:17.340142: W .\tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage HoistCommonFactor node add_5. Error: Node ArithmeticOptimizer/HoistCommonFactor_Add_add _4 is missing output properties at position :0 (num_outputs=0) 2019-07-25 20:46:17.346208: W .\tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage HoistCommonFactor node ArithmeticOptimizer/AddOpsRewrite_add_6. Error: Node ArithmeticO ptimizer/HoistCommonFactor_Add_add_4 is missing output properties at position :0 (num_outputs=0) 2019-07-25 20:46:17.660019: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library cublas64_100.dll locally 2012019-07-25 20:46:31.033337: E tensorflow/stream_executor/cuda/cuda_timer.cc:55] Internal: error destroying CUDA event in context 0000023C7AF4DA50: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure 2019-07-25 20:46:31.040712: E tensorflow/stream_executor/cuda/cuda_timer.cc:60] Internal: error destroying CUDA event in context 0000023C7AF4DA50: CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure 2019-07-25 20:46:31.047944: F tensorflow/stream_executor/cuda/cuda_dnn.cc:194] Check failed: status == CUDNN_STATUS_SUCCESS (7 vs. 0)Failed to set cuDNN stream. ` Thanks for your reply!

mauriceqch commented 5 years ago

Hello Allan,

It shouldn't make a difference but could you try setting the batch size to 1 (--batch_size 1) ?

Alternatively, could you please try it on Ubuntu ? The error seems to be an out of memory error and it may be due to the OS difference. In my experience, deep learning libraries tend to be have better support on linux.

Allan-Xu commented 5 years ago

Dear author Thanks for your reply! I will try to run the project in the ubuntu! Thank you so much!

发自我的iPhone

------------------ Original ------------------ From: mauriceqch notifications@github.com Date: Tue,Jul 30,2019 0:05 AM To: mauriceqch/pcc_geo_cnn pcc_geo_cnn@noreply.github.com Cc: Allan Xu 971163479@qq.com, Author author@noreply.github.com Subject: Re: [mauriceqch/pcc_geo_cnn] When the ”train.py” file is running, the GPU runs up to 100%, and then the computer is black screen, and the program crashes. (#2)

Hello Allan,

It shouldn't make a difference but could you try setting the batch size to 1 (--batch_size 1) ?

Alternatively, could you please try it on Ubuntu ? The error seems to be an out of memory error and it may be due to the OS difference. In my experience, deep learning libraries tend to be have better support on linux.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

mauriceqch / pcc_geo_cnn

When the ”train.py” file is running, the GPU runs up to 100%, and then the computer is black screen, and the program crashes. #2