The error, please help me.

jiaruHithub commented 3 years ago

When I run the code, I have not download the dataset Stanford3dDataset_v1.2_Aligned_Version.zip. I just use S3DIS. The error occured as fellow:

2021-02-21 05:14:43.910830: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 3932160 totalling 3.75MiB
2021-02-21 05:14:43.910862: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 6 Chunks of size 8388608 totalling 48.00MiB
2021-02-21 05:14:43.910895: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 11665408 totalling 11.12MiB
2021-02-21 05:14:43.910927: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 3 Chunks of size 14680064 totalling 42.00MiB
2021-02-21 05:14:43.910960: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 2 Chunks of size 33554432 totalling 64.00MiB
2021-02-21 05:14:43.910992: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 37748736 totalling 36.00MiB
2021-02-21 05:14:43.911024: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 54 Chunks of size 134217728 totalling 6.75GiB
2021-02-21 05:14:43.911057: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 136314880 totalling 130.00MiB
2021-02-21 05:14:43.911090: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 209643264 totalling 199.93MiB
2021-02-21 05:14:43.911122: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 6 Chunks of size 268435456 totalling 1.50GiB
2021-02-21 05:14:43.911158: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 4 Chunks of size 536870912 totalling 2.00GiB
2021-02-21 05:14:43.911191: I tensorflow/core/common_runtime/bfc_allocator.cc:645] Sum Total of in-use chunks: 10.79GiB
2021-02-21 05:14:43.911226: I tensorflow/core/common_runtime/bfc_allocator.cc:647] Stats: 
Limit:                 12044415796
InUse:                 11584530176
MaxInUse:              11584530176
NumAllocs:                     590
MaxAllocSize:            856752640

2021-02-21 05:14:43.911371: W tensorflow/core/common_runtime/bfc_allocator.cc:271] *************************************************************************************************___
2021-02-21 05:14:43.911439: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at topk_op.cc:92 : Resource exhausted: OOM when allocating tensor with shape[1073742079] and type int8 on /job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1073742079] and type int8 on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node tower_0/TopKV2_7}} = TopKV2[T=DT_FLOAT, _class=["loc:@tower_0/cond_7/strided_slice_2/Switch"], sorted=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower_0/Neg_7, tower_0/TopKV2_7/k)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[{{node tower_0/gradients/AddN_30/_1927}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_8956_tower_0/gradients/AddN_30", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 346, in <module>
    train()
  File "train.py", line 289, in train
    train_one_epoch(sess, ops, train_writer)
  File "train.py", line 333, in train_one_epoch
    feed_dict=feed_dict)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1073742079] and type int8 on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node tower_0/TopKV2_7 (defined at /home/deep_gcns-master/deep_gcns-master/utils/tf_util.py:672)  = TopKV2[T=DT_FLOAT, _class=["loc:@tower_0/cond_7/strided_slice_2/Switch"], sorted=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower_0/Neg_7, tower_0/TopKV2_7/k)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[{{node tower_0/gradients/AddN_30/_1927}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_8956_tower_0/gradients/AddN_30", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Caused by op 'tower_0/TopKV2_7', defined at:
  File "train.py", line 346, in <module>
    train()
  File "train.py", line 222, in train
    dilations=DILATIONS)
  File "/home/deep_gcns-master/deep_gcns-master/sem_seg/model.py", line 55, in __init__
    dilations)
  File "/home/deep_gcns-master/deep_gcns-master/sem_seg/model.py", line 102, in build_gcn_backbone_block
    is_training=self.is_training)
  File "/home/deep_gcns-master/deep_gcns-master/gcn_lib/gcn_utils.py", line 47, in build
    is_training=is_training)
  File "/home/deep_gcns-master/deep_gcns-master/gcn_lib/tf_edge.py", line 41, in dilated_knn_graph
    neigh_idx = tf_util.knn(dists, k=k*dilation)
  File "/home/deep_gcns-master/deep_gcns-master/utils/tf_util.py", line 672, in knn
    _, nn_idx = tf.nn.top_k(neg_adj, k=k)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/nn_ops.py", line 2359, in top_k
    return gen_nn_ops.top_kv2(input, k=k, sorted=sorted, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 7701, in top_kv2
    "TopKV2", input=input, k=k, sorted=sorted, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1073742079] and type int8 on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node tower_0/TopKV2_7 (defined at /home/deep_gcns-master/deep_gcns-master/utils/tf_util.py:672)  = TopKV2[T=DT_FLOAT, _class=["loc:@tower_0/cond_7/strided_slice_2/Switch"], sorted=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower_0/Neg_7, tower_0/TopKV2_7/k)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[{{node tower_0/gradients/AddN_30/_1927}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_8956_tower_0/gradients/AddN_30", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Errors keep popping up, the code doesn't stop, it just keeps reporting errors and looping through EPOACH001. Please help me ,thank you!

lightaime commented 3 years ago

Hi @jiaruHithub. It seems the problem happens because of out-of-memory. Could you try to use a GPU with large memory or reduce the batch size?

jiaruHithub commented 3 years ago

Hi @jiaruHithub. It seems the problem happens because of out-of-memory. Could you try to use a GPU with large memory or reduce the batch size?

Thank you for the anwer from the author himself. By the way, I have few question:

Tomorrow I will use 4 tesla V100 to train the code. I would like to know How to change the code can make it train with 4 GPUs in parallel?
In tensorflow version, What is function of dataset ``Stanford3dDataset_v1.2_Aligned_Version.zip``` ? torch version no this dataset. If you can answer, I will be very grateful.

lightaime commented 3 years ago

No worries. As for the two questions:

To run the code with 4 GPUs, you can pass --num_gpu 4 as the argument to the training script.
There two versions of s3ids including the original one and the aligned one. You can refer to this issue https://github.com/charlesq34/pointnet/issues/20 to learn more about it.

jiaruHithub commented 3 years ago

No worries. As for the two questions:

To run the code with 4 GPUs, you can pass --num_gpu 4 as the argument to the training script.

There two versions of s3ids including the original one and the aligned one. You can refer to this issue charlesq34/pointnet#20 to learn more about it.

Today I use 4 V100 to train the code, But the error was reported:

root@22f049f4b6e0:/home/deep_gcns-master/deep_gcns-master/sem_seg# sh +x train_job.sh
/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/usr/local/lib/python3.5/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/usr/local/lib/python3.5/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/usr/local/lib/python3.5/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/usr/local/lib/python3.5/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/usr/local/lib/python3.5/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/usr/local/lib/python3.5/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Namespace(batch_size=8, bn_decay_clip=0.99, bn_decay_decay_rate=0.5, bn_decay_decay_step=300000, bn_init_decay=0.5, checkpoint='', dataset='s3dis', decay_rate=0.5, decay_step=300000, dilations=[-1], edge_lay='dilated', gcn='edgeconv', learning_rate=0.001, log_dir='ResGCN-28/log1', max_epoch=151, model='model', momentum=0.9, normalize_sage=False, num_classes=13, num_filters=[64], num_gpu=4, num_layers=28, num_neighbors=[16], num_points=4096, optimizer='adam', skip_connect='residual', sto_dilated_epsilon=0.2, stochastic_dilation=True, test_area=1, tower_name='tower', zero_epsilon_gin=False)
Using edgeconv gcn
Training on Stanford 3D Indoor Spaces Dataset
Room files length 23585
Train set shape inputs (19898, 4096, 9), labels (19898, 4096)
Test set shape inputs (3687, 4096, 9), labels (3687, 4096)
WARNING:tensorflow:From /home/deep_gcns-master/deep_gcns-master/utils/tf_util.py:712: The name tf.train.exponential_decay is deprecated. Please use tf.compat.v1.train.exponential_decay instead.

WARNING:tensorflow:From train.py:143: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.

Using Adam as optimizer
WARNING:tensorflow:From train.py:158: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From train.py:167: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

WARNING:tensorflow:From train.py:167: The name tf.get_variable_scope is deprecated. Please use tf.compat.v1.get_variable_scope instead.

####################################################################################################
Building model residual edge_conv_layer dilated_knn_graph with 28 layers
WARNING:tensorflow:From /home/deep_gcns-master/deep_gcns-master/sem_seg/model.py:63: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /home/deep_gcns-master/deep_gcns-master/utils/tf_util.py:657: calling reduce_sum_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From /home/deep_gcns-master/deep_gcns-master/gcn_lib/tf_edge.py:73: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /home/deep_gcns-master/deep_gcns-master/gcn_lib/tf_edge.py:67: The name tf.random_shuffle is deprecated. Please use tf.random.shuffle instead.

WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

WARNING:tensorflow:From /home/deep_gcns-master/deep_gcns-master/gcn_lib/tf_vertex.py:99: calling reduce_max_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From /home/deep_gcns-master/deep_gcns-master/utils/tf_util.py:377: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.

WARNING:tensorflow:From /home/deep_gcns-master/deep_gcns-master/utils/tf_util.py:635: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Done!!!
####################################################################################################
WARNING:tensorflow:From train.py:235: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
####################################################################################################
Building model residual edge_conv_layer dilated_knn_graph with 28 layers
Done!!!
####################################################################################################
####################################################################################################
Building model residual edge_conv_layer dilated_knn_graph with 28 layers
Done!!!
####################################################################################################
####################################################################################################
Building model residual edge_conv_layer dilated_knn_graph with 28 layers
Done!!!
####################################################################################################
WARNING:tensorflow:From train.py:249: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.

2021-02-23 04:56:12.175095: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-02-23 04:56:12.184252: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2021-02-23 04:56:13.037932: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x16091d50 executing computations on platform CUDA. Devices:
2021-02-23 04:56:13.037981: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla V100-DGXS-32GB, Compute Capability 7.0
2021-02-23 04:56:13.037997: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (1): Tesla V100-DGXS-32GB, Compute Capability 7.0
2021-02-23 04:56:13.038010: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (2): Tesla V100-DGXS-32GB, Compute Capability 7.0
2021-02-23 04:56:13.038021: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (3): Tesla V100-DGXS-32GB, Compute Capability 7.0
2021-02-23 04:56:13.060843: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2198970000 Hz
2021-02-23 04:56:13.064148: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x161caff0 executing computations on platform Host. Devices:
2021-02-23 04:56:13.064206: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2021-02-23 04:56:13.068652: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: Tesla V100-DGXS-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:07:00.0
2021-02-23 04:56:13.072833: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
name: Tesla V100-DGXS-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:08:00.0
2021-02-23 04:56:13.076972: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 2 with properties:
name: Tesla V100-DGXS-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:0e:00.0
2021-02-23 04:56:13.079347: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 3 with properties:
name: Tesla V100-DGXS-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:0f:00.0
2021-02-23 04:56:13.079502: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-02-23 04:56:13.079594: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-02-23 04:56:13.079680: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-02-23 04:56:13.079763: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-02-23 04:56:13.079844: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-02-23 04:56:13.079925: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-02-23 04:56:13.083293: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2021-02-23 04:56:13.083329: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2021-02-23 04:56:13.083375: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-02-23 04:56:13.083393: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 1 2 3
2021-02-23 04:56:13.083405: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N Y Y Y
2021-02-23 04:56:13.083415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1:   Y N Y Y
2021-02-23 04:56:13.083425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 2:   Y Y N Y
2021-02-23 04:56:13.083436: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 3:   Y Y Y N
WARNING:tensorflow:From train.py:258: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.

WARNING:tensorflow:From train.py:259: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

2021-02-23 04:56:23.254683: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
**** EPOCH 001 ****
----
Current batch/total batch num: 0/621
2021-02-23 04:56:33.930261: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:697] Iteration = 0, topological sort failed with message: The graph couldn't be sorted in topological order.
2021-02-23 04:56:34.409448: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:697] Iteration = 1, topological sort failed with message: The graph couldn't be sorted in topological order.
2021-02-23 04:56:36.750515: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] remapper failed: Invalid argument: The graph couldn't be sorted in topological order.
2021-02-23 04:56:36.897671: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] arithmetic_optimizer failed: Invalid argument: The graph couldn't be sorted in topological order.
2021-02-23 04:56:37.022374: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:697] Iteration = 0, topological sort failed with message: The graph couldn't be sorted in topological order.
2021-02-23 04:56:37.225320: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:697] Iteration = 1, topological sort failed with message: The graph couldn't be sorted in topological order.

And the condation of memory using as fellow:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-DGXS...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   36C    P0    51W / 300W |    317MiB / 32478MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-DGXS...  On   | 00000000:08:00.0 Off |                    0 |
| N/A   34C    P0    53W / 300W |    317MiB / 32478MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-DGXS...  On   | 00000000:0E:00.0 Off |                    0 |
| N/A   35C    P0    51W / 300W |    317MiB / 32478MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-DGXS...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   36C    P0    50W / 300W |    317MiB / 32478MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2876      C   python                                       305MiB |
|    1      2876      C   python                                       305MiB |
|    2      2876      C   python                                       305MiB |
|    3      2876      C   python                                       305MiB |
+-----------------------------------------------------------------------------+

It is notable that I have not donduct the step2 as fellow:

2. Download 3D indoor parsing dataset (<a href="http://buildingparser.stanford.edu/dataset.html">S3DIS Dataset</a>) for testing and visualization. "Stanford3dDataset_v1.2_Aligned_Version.zip" of the dataset is used. Unzip the downloaded file into "deep_gcns/data" and merge with the folder `Stanford3dDataset_v1.2_Aligned_Version` which already contains the patches `S3DIS_PATCH.diff` and `DS_STORE_PATCH.diff` then run,
...........
and so on

I just run sh +x download_data.sh and run sh +x train_job.sh to train the code. Do I have to run the Step2? Thank you!

lightaime commented 3 years ago

Hi @jiaruHithub. I only see warmings but no errors. Does the python process crash? It may take a while to get the output.

jiaruHithub commented 3 years ago

Hi @jiaruHithub. I only see warmings but no errors. Does the python process crash? It may take a while to get the output.

Thank you. The code was run. But 4 v100 run 24 hours only train 4 epoch. I check this warning in tensorflow's issue, may people discover this warning seriously slows down training speed. Do you have a solution or what do you think the problem is? Thank you!

lightaime commented 3 years ago

Hi @jiaruHithub. Sorry that I also have no idea why this happens exactly. But I was able to run 100 epochs within 2-3 days on 2 v100s. I found https://github.com/tensorflow/tensorflow/issues/24816#issuecomment-575949674 this issue may help by replacing tf.concat with tf.stack.

jiaruHithub commented 3 years ago

Hi @jiaruHithub. Sorry that I also have no idea why this happens exactly. But I was able to run 100 epochs within 2-3 days on 2 v100s. I found tensorflow/tensorflow#24816 (comment) this issue may help by replacing tf.concat with tf.stack.

Thank you. Those days I try to fix this problem by some approaches. But it seems no effect. : ) I'll keep an eye on it. Thank u!

lightaime / deep_gcns

The error, please help me. #16