Closed jiaruHithub closed 3 years ago
Hi @jiaruHithub. It seems the problem happens because of out-of-memory. Could you try to use a GPU with large memory or reduce the batch size?
Hi @jiaruHithub. It seems the problem happens because of out-of-memory. Could you try to use a GPU with large memory or reduce the batch size?
Thank you for the anwer from the author himself. By the way, I have few question:
Tomorrow I will use 4 tesla V100 to train the code. I would like to know How to change the code can make it train with 4 GPUs in parallel?
In tensorflow version, What is function of dataset ``Stanford3dDataset_v1.2_Aligned_Version.zip``` ? torch version no this dataset. If you can answer, I will be very grateful.
No worries. As for the two questions:
--num_gpu 4
as the argument to the training script. No worries. As for the two questions:
- To run the code with 4 GPUs, you can pass
--num_gpu 4
as the argument to the training script.- There two versions of s3ids including the original one and the aligned one. You can refer to this issue charlesq34/pointnet#20 to learn more about it.
Today I use 4 V100 to train the code, But the error was reported:
root@22f049f4b6e0:/home/deep_gcns-master/deep_gcns-master/sem_seg# sh +x train_job.sh
/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
/usr/local/lib/python3.5/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/usr/local/lib/python3.5/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/usr/local/lib/python3.5/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/usr/local/lib/python3.5/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/usr/local/lib/python3.5/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/usr/local/lib/python3.5/dist-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Namespace(batch_size=8, bn_decay_clip=0.99, bn_decay_decay_rate=0.5, bn_decay_decay_step=300000, bn_init_decay=0.5, checkpoint='', dataset='s3dis', decay_rate=0.5, decay_step=300000, dilations=[-1], edge_lay='dilated', gcn='edgeconv', learning_rate=0.001, log_dir='ResGCN-28/log1', max_epoch=151, model='model', momentum=0.9, normalize_sage=False, num_classes=13, num_filters=[64], num_gpu=4, num_layers=28, num_neighbors=[16], num_points=4096, optimizer='adam', skip_connect='residual', sto_dilated_epsilon=0.2, stochastic_dilation=True, test_area=1, tower_name='tower', zero_epsilon_gin=False)
Using edgeconv gcn
Training on Stanford 3D Indoor Spaces Dataset
Room files length 23585
Train set shape inputs (19898, 4096, 9), labels (19898, 4096)
Test set shape inputs (3687, 4096, 9), labels (3687, 4096)
WARNING:tensorflow:From /home/deep_gcns-master/deep_gcns-master/utils/tf_util.py:712: The name tf.train.exponential_decay is deprecated. Please use tf.compat.v1.train.exponential_decay instead.
WARNING:tensorflow:From train.py:143: The name tf.summary.scalar is deprecated. Please use tf.compat.v1.summary.scalar instead.
Using Adam as optimizer
WARNING:tensorflow:From train.py:158: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.
WARNING:tensorflow:From train.py:167: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.
WARNING:tensorflow:From train.py:167: The name tf.get_variable_scope is deprecated. Please use tf.compat.v1.get_variable_scope instead.
####################################################################################################
Building model residual edge_conv_layer dilated_knn_graph with 28 layers
WARNING:tensorflow:From /home/deep_gcns-master/deep_gcns-master/sem_seg/model.py:63: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
WARNING:tensorflow:From /home/deep_gcns-master/deep_gcns-master/utils/tf_util.py:657: calling reduce_sum_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From /home/deep_gcns-master/deep_gcns-master/gcn_lib/tf_edge.py:73: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.
WARNING:tensorflow:From /home/deep_gcns-master/deep_gcns-master/gcn_lib/tf_edge.py:67: The name tf.random_shuffle is deprecated. Please use tf.random.shuffle instead.
WARNING:tensorflow:
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
* https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.
WARNING:tensorflow:From /home/deep_gcns-master/deep_gcns-master/gcn_lib/tf_vertex.py:99: calling reduce_max_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From /home/deep_gcns-master/deep_gcns-master/utils/tf_util.py:377: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead.
WARNING:tensorflow:From /home/deep_gcns-master/deep_gcns-master/utils/tf_util.py:635: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Done!!!
####################################################################################################
WARNING:tensorflow:From train.py:235: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
####################################################################################################
Building model residual edge_conv_layer dilated_knn_graph with 28 layers
Done!!!
####################################################################################################
####################################################################################################
Building model residual edge_conv_layer dilated_knn_graph with 28 layers
Done!!!
####################################################################################################
####################################################################################################
Building model residual edge_conv_layer dilated_knn_graph with 28 layers
Done!!!
####################################################################################################
WARNING:tensorflow:From train.py:249: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.
2021-02-23 04:56:12.175095: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-02-23 04:56:12.184252: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2021-02-23 04:56:13.037932: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x16091d50 executing computations on platform CUDA. Devices:
2021-02-23 04:56:13.037981: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Tesla V100-DGXS-32GB, Compute Capability 7.0
2021-02-23 04:56:13.037997: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (1): Tesla V100-DGXS-32GB, Compute Capability 7.0
2021-02-23 04:56:13.038010: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (2): Tesla V100-DGXS-32GB, Compute Capability 7.0
2021-02-23 04:56:13.038021: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (3): Tesla V100-DGXS-32GB, Compute Capability 7.0
2021-02-23 04:56:13.060843: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2198970000 Hz
2021-02-23 04:56:13.064148: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x161caff0 executing computations on platform Host. Devices:
2021-02-23 04:56:13.064206: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
2021-02-23 04:56:13.068652: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: Tesla V100-DGXS-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:07:00.0
2021-02-23 04:56:13.072833: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
name: Tesla V100-DGXS-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:08:00.0
2021-02-23 04:56:13.076972: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 2 with properties:
name: Tesla V100-DGXS-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:0e:00.0
2021-02-23 04:56:13.079347: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 3 with properties:
name: Tesla V100-DGXS-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:0f:00.0
2021-02-23 04:56:13.079502: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-02-23 04:56:13.079594: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-02-23 04:56:13.079680: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-02-23 04:56:13.079763: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-02-23 04:56:13.079844: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-02-23 04:56:13.079925: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-02-23 04:56:13.083293: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2021-02-23 04:56:13.083329: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2021-02-23 04:56:13.083375: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-02-23 04:56:13.083393: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 1 2 3
2021-02-23 04:56:13.083405: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N Y Y Y
2021-02-23 04:56:13.083415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1: Y N Y Y
2021-02-23 04:56:13.083425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 2: Y Y N Y
2021-02-23 04:56:13.083436: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 3: Y Y Y N
WARNING:tensorflow:From train.py:258: The name tf.summary.merge_all is deprecated. Please use tf.compat.v1.summary.merge_all instead.
WARNING:tensorflow:From train.py:259: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.
2021-02-23 04:56:23.254683: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
**** EPOCH 001 ****
----
Current batch/total batch num: 0/621
2021-02-23 04:56:33.930261: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:697] Iteration = 0, topological sort failed with message: The graph couldn't be sorted in topological order.
2021-02-23 04:56:34.409448: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:697] Iteration = 1, topological sort failed with message: The graph couldn't be sorted in topological order.
2021-02-23 04:56:36.750515: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] remapper failed: Invalid argument: The graph couldn't be sorted in topological order.
2021-02-23 04:56:36.897671: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] arithmetic_optimizer failed: Invalid argument: The graph couldn't be sorted in topological order.
2021-02-23 04:56:37.022374: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:697] Iteration = 0, topological sort failed with message: The graph couldn't be sorted in topological order.
2021-02-23 04:56:37.225320: E tensorflow/core/grappler/optimizers/dependency_optimizer.cc:697] Iteration = 1, topological sort failed with message: The graph couldn't be sorted in topological order.
And the condation of memory using as fellow:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-DGXS... On | 00000000:07:00.0 Off | 0 |
| N/A 36C P0 51W / 300W | 317MiB / 32478MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-DGXS... On | 00000000:08:00.0 Off | 0 |
| N/A 34C P0 53W / 300W | 317MiB / 32478MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-DGXS... On | 00000000:0E:00.0 Off | 0 |
| N/A 35C P0 51W / 300W | 317MiB / 32478MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-DGXS... On | 00000000:0F:00.0 Off | 0 |
| N/A 36C P0 50W / 300W | 317MiB / 32478MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2876 C python 305MiB |
| 1 2876 C python 305MiB |
| 2 2876 C python 305MiB |
| 3 2876 C python 305MiB |
+-----------------------------------------------------------------------------+
It is notable that I have not donduct the step2 as fellow:
2. Download 3D indoor parsing dataset (<a href="http://buildingparser.stanford.edu/dataset.html">S3DIS Dataset</a>) for testing and visualization. "Stanford3dDataset_v1.2_Aligned_Version.zip" of the dataset is used. Unzip the downloaded file into "deep_gcns/data" and merge with the folder `Stanford3dDataset_v1.2_Aligned_Version` which already contains the patches `S3DIS_PATCH.diff` and `DS_STORE_PATCH.diff` then run,
...........
and so on
I just run sh +x download_data.sh
and run sh +x train_job.sh
to train the code.
Do I have to run the Step2
?
Thank you!
Hi @jiaruHithub. I only see warmings but no errors. Does the python process crash? It may take a while to get the output.
Hi @jiaruHithub. I only see warmings but no errors. Does the python process crash? It may take a while to get the output.
Thank you. The code was run. But 4 v100 run 24 hours only train 4 epoch. I check this warning in tensorflow's issue, may people discover this warning seriously slows down training speed. Do you have a solution or what do you think the problem is? Thank you!
Hi @jiaruHithub. Sorry that I also have no idea why this happens exactly. But I was able to run 100 epochs within 2-3 days on 2 v100s. I found https://github.com/tensorflow/tensorflow/issues/24816#issuecomment-575949674 this issue may help by replacing tf.concat
with tf.stack
.
Hi @jiaruHithub. Sorry that I also have no idea why this happens exactly. But I was able to run 100 epochs within 2-3 days on 2 v100s. I found tensorflow/tensorflow#24816 (comment) this issue may help by replacing
tf.concat
withtf.stack
.
Thank you. Those days I try to fix this problem by some approaches. But it seems no effect. : ) I'll keep an eye on it. Thank u!
When I run the code, I have not download the dataset
Stanford3dDataset_v1.2_Aligned_Version.zip.
I just use S3DIS. The error occured as fellow:Errors keep popping up, the code doesn't stop, it just keeps reporting errors and looping through EPOACH001. Please help me ,thank you!