sogou / SogouMRCToolkit

This toolkit was designed for the fast and efficient development of modern machine comprehension models, including both published models and original prototypes.
Apache License 2.0
746 stars 162 forks source link

ran out of memory #1

Closed SunYanCN closed 5 years ago

SunYanCN commented 5 years ago

TF: tensorflow-gpu==1.12 显卡:Tesla P4 8G 尝试运行run_bidafplus_squad.py,报了显存分配的问题,我不知道这会不会对运行结果有影响

2019-04-07 05:11:40.657538: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-04-07 05:11:41.446788: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-04-07 05:11:41.447151: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla P4 major: 6 minor: 1 memoryClockRate(GHz): 1.1135
pciBusID: 0000:00:06.0
totalMemory: 7.43GiB freeMemory: 7.31GiB
2019-04-07 05:11:41.447178: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-04-07 05:11:41.882084: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-07 05:11:41.882132: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-04-07 05:11:41.882141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-04-07 05:11:41.882363: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7051 MB memory) -> physical GPU (device: 0, name: Tesla P4, pci bus id: 0000:00:06.0, compute capability: 6.1)
2019-04-07 05:11:42,321 - root - INFO - Reading file at train-v1.1.json
2019-04-07 05:11:42,322 - root - INFO - Processing the dataset.
87599it [07:43, 189.13it/s]
2019-04-07 05:19:25,497 - root - INFO - Reading file at dev-v1.1.json
2019-04-07 05:19:25,497 - root - INFO - Processing the dataset.
10570it [00:53, 196.53it/s]
2019-04-07 05:20:19,349 - root - INFO - Building vocabulary.
100%|███████████████████████████████████| 98169/98169 [00:30<00:00, 3218.07it/s]
2019-04-07 05:21:05.747563: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-04-07 05:21:05.747695: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-07 05:21:05.747711: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-04-07 05:21:05.747718: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-04-07 05:21:05.747925: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7051 MB memory) -> physical GPU (device: 0, name: Tesla P4, pci bus id: 0000:00:06.0, compute capability: 6.1)
2019-04-07 05:21:06.489069: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-04-07 05:21:06.489145: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-07 05:21:06.489156: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-04-07 05:21:06.489162: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-04-07 05:21:06.489389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7051 MB memory) -> physical GPU (device: 0, name: Tesla P4, pci bus id: 0000:00:06.0, compute capability: 6.1)
2019-04-07 05:21:07.117979: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-04-07 05:21:07.118055: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-07 05:21:07.118066: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-04-07 05:21:07.118072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-04-07 05:21:07.118278: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7051 MB memory) -> physical GPU (device: 0, name: Tesla P4, pci bus id: 0000:00:06.0, compute capability: 6.1)
2019-04-07 05:21:13,046 - root - INFO - Epoch 1/15
2019-04-07 05:21:13,351 - root - INFO - Eposide 1/2
2019-04-07 05:21:23.422390: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this may take a while): 10494 of 87599
2019-04-07 05:21:33.422566: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this may take a while): 21931 of 87599
2019-04-07 05:21:43.422157: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this may take a while): 32210 of 87599
2019-04-07 05:21:53.422415: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this may take a while): 42018 of 87599
2019-04-07 05:22:03.422089: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this may take a while): 52336 of 87599
2019-04-07 05:22:13.422587: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this may take a while): 62125 of 87599
2019-04-07 05:22:23.422099: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this may take a while): 72157 of 87599
2019-04-07 05:22:33.421957: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:98] Filling up shuffle buffer (this may take a while): 82242 of 87599
2019-04-07 05:22:38.605655: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:136] Shuffle buffer filled.
2019-04-07 05:22:57.952087: E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 2.88G (3091968768 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-04-07 05:23:27.134938: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.96GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-07 05:24:09.911666: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.28GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-07 05:28:01.375542: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.23GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-07 05:28:01.673176: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.94GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-07 05:28:33.173192: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.92GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-07 05:28:33.490319: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.93GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-07 05:28:33.502105: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.52GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-07 05:28:44,872 - root - INFO - - Train metrics: loss: 5.875
2019-04-07 05:28:46.141381: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.27GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-07 05:28:46.477394: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.64GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-07 05:28:47.501813: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.09GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-04-07 05:29:05,078 - root - INFO - - Eval metrics: loss: 3.759
2019-04-07 05:29:21,705 - root - INFO - - Eval metrics: exact_match: 51.325 ; f1: 63.040
2019-04-07 05:29:21,705 - root - INFO - - epoch 1 eposide 1: Found new best score: 63.039909
2019-04-07 05:29:21,705 - root - INFO - Eposide 2/2
2019-04-07 05:34:47,135 - root - INFO - - Train metrics: loss: 4.882
2019-04-07 05:35:02,895 - root - INFO - - Eval metrics: loss: 3.376
2019-04-07 05:35:19,210 - root - INFO - - Eval metrics: exact_match: 57.313 ; f1: 68.490
2019-04-07 05:35:19,210 - root - INFO - - epoch 1 eposide 2: Found new best score: 68.490210
2019-04-07 05:35:19,210 - root - INFO - Epoch 2/15
2019-04-07 05:35:19,213 - root - INFO - Eposide 1/2
yylun commented 5 years ago

@SunYanCN Our examples are tested on P40 and V100, so we have not encountered such problem yet. Maybe you could try a smaller batch_size or shuffle_ratio in BatchGenerator

zww847204326 commented 4 years ago

Can you tell me your version of cuda and cudnn. I get trouble when i try to run it.

Possibly insufficient driver version: 415.27.0 UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.