运行LADAN+MTL_large.py时出现很长时间没跑完一个epoch后OOM的问题

PolarisRisingWar commented 1 year ago

我在运行LADAN+MTL_larg.py时，在运行20小时后还没有跑出一个epoch的结果，而且还报了OOM。我batch size已经改得很小了，想问问影响显存占用量的还会有什么其他因素吗？我不常用TensorFlow，遇到类似情况能有什么办法来快速debug吗？

我的报错信息大概是这样：

/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Using TensorFlow backend.
WARNING:tensorflow:From LADAN+MTL_large.py:187: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From LADAN+MTL_large.py:240: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

WARNING:tensorflow:From /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/keras/backend.py:3794: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From LADAN+MTL_large.py:347: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.

WARNING:tensorflow:From LADAN+MTL_large.py:349: arg_max (from tensorflow.python.ops.gen_math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.math.argmax` instead
WARNING:tensorflow:From LADAN+MTL_large.py:376: The name tf.losses.softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.softmax_cross_entropy instead.

WARNING:tensorflow:From LADAN+MTL_large.py:412: The name tf.add_to_collection is deprecated. Please use tf.compat.v1.add_to_collection instead.

WARNING:tensorflow:From LADAN+MTL_large.py:416: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead.

WARNING:tensorflow:From LADAN+MTL_large.py:420: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.

WARNING:tensorflow:From LADAN+MTL_large.py:429: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

2022-12-08 16:40:23.383015: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2022-12-08 16:40:23.426005: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 1800000000 Hz
2022-12-08 16:40:23.430214: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b5f54978e0 executing computations on platform Host. Devices:
2022-12-08 16:40:23.430288: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2022-12-08 16:40:23.436761: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2022-12-08 16:40:26.878260: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b5f53210d0 executing computations on platform CUDA. Devices:
2022-12-08 16:40:26.878370: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2022-12-08 16:40:26.880622: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:3b:00.0
2022-12-08 16:40:26.881237: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2022-12-08 16:40:26.884967: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2022-12-08 16:40:26.887944: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0
2022-12-08 16:40:26.888472: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0
2022-12-08 16:40:26.891360: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0
2022-12-08 16:40:26.893620: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0
2022-12-08 16:40:26.898651: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2022-12-08 16:40:26.899852: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2022-12-08 16:40:26.899903: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2022-12-08 16:40:26.901230: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2022-12-08 16:40:26.901252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2022-12-08 16:40:26.901259: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
2022-12-08 16:40:26.902515: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3777 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:3b:00.0, compute capability: 7.5)
2022-12-08 16:40:28.982252: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
2022-12-08 16:41:14.202769: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2022-12-09 12:48:53.798908: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Allocator (GPU_0_bfc) ran out of memory trying to allocate 172.85MiB (rounded to 181248000).  Current allocation summary follows.
2022-12-09 12:48:53.799598: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (256):   Total Chunks: 87, Chunks in use: 86. 21.8KiB allocated for chunks. 21.5KiB in use in bin. 624B client-requested in use in bin.
2022-12-09 12:48:53.799637: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (512):   Total Chunks: 21, Chunks in use: 20. 12.0KiB allocated for chunks. 11.2KiB in use in bin. 8.5KiB client-requested in use in bin.
2022-12-09 12:48:53.799654: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (1024):  Total Chunks: 81, Chunks in use: 80. 94.5KiB allocated for chunks. 93.2KiB in use in bin. 91.2KiB client-requested in use in bin.
2022-12-09 12:48:53.799669: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (2048):  Total Chunks: 13, Chunks in use: 12. 29.0KiB allocated for chunks. 25.2KiB in use in bin. 21.0KiB client-requested in use in bin.
2022-12-09 12:48:53.799683: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (4096):  Total Chunks: 9, Chunks in use: 8. 43.0KiB allocated for chunks. 38.0KiB in use in bin. 37.8KiB client-requested in use in bin.
2022-12-09 12:48:53.799699: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (8192):  Total Chunks: 95, Chunks in use: 66. 1.32MiB allocated for chunks. 979.5KiB in use in bin. 978.0KiB client-requested in use in bin.
（中间的报错内容类似）
2022-12-09 12:48:53.799908: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (67108864):  Total Chunks: 3, Chunks in use: 0. 259.28MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-09 12:48:53.799922: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (134217728):     Total Chunks: 10, Chunks in use: 9. 1.59GiB allocated for chunks. 1.43GiB in use in bin. 1.43GiB client-requested in use in bin.
2022-12-09 12:48:53.799936: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (268435456):     Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-12-09 12:48:53.799953: I tensorflow/core/common_runtime/bfc_allocator.cc:780] Bin for 172.85MiB was 128.00MiB, Chunk State: 
2022-12-09 12:48:53.799981: I tensorflow/core/common_runtime/bfc_allocator.cc:786]   Size: 160.07MiB | Requested Size: 60.0KiB | in_use: 0 | bin_num: 19, prev:   Size: 172.85MiB | Requested Size: 172.85MiB | in_use: 1 | bin_num: -1
2022-12-09 12:48:53.799993: I tensorflow/core/common_runtime/bfc_allocator.cc:793] Next region of size 1780940800
2022-12-09 12:48:53.800006: I tensorflow/core/common_runtime/bfc_allocator.cc:800] Free  at 0x7fa1e4000000 next 5199 of size 24576000
2022-12-09 12:48:53.800018: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fa1e5770000 next 620 of size 906240
2022-12-09 12:48:53.800029: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fa1e584d400 next 2630 of size 226560
（中间的报错内容类似）

2022-12-09 12:48:53.849287: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 91136 totalling 89.0KiB
2022-12-09 12:48:53.849295: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 2 Chunks of size 91392 totalling 178.5KiB
（一大堆类似的报错内容）
2022-12-09 12:48:53.851402: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 2 Chunks of size 134217728 totalling 256.00MiB
2022-12-09 12:48:53.851410: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 7 Chunks of size 181248000 totalling 1.18GiB
2022-12-09 12:48:53.851417: I tensorflow/core/common_runtime/bfc_allocator.cc:816] Sum Total of in-use chunks: 3.04GiB
2022-12-09 12:48:53.851426: I tensorflow/core/common_runtime/bfc_allocator.cc:818] total_region_allocated_bytes_: 3960930304 memory_limit_: 3960930304 available bytes: 0 curr_region_allocation_bytes_: 4294967296
2022-12-09 12:48:53.851444: I tensorflow/core/common_runtime/bfc_allocator.cc:824] Stats: 
Limit:                  3960930304
InUse:                  3267110912
MaxInUse:               3686056448
NumAllocs:              2608654986
MaxAllocSize:            315457792

2022-12-09 12:48:53.852011: W tensorflow/core/common_runtime/bfc_allocator.cc:319] **__*_*_*********************************___********************************************************
2022-12-09 12:48:53.852070: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at transpose_op.cc:199 : Resource exhausted: OOM when allocating tensor with shape[118,256,15,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
程序开始运行时间：
2022-12-08 16:39:35.666083
Model loaded succeed
Model loaded succeed
Traceback (most recent call last):
  File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[118,256,15,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node gradients/mul_grad/Mul_1-1-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[Adam/update/_990]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[118,256,15,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node gradients/mul_grad/Mul_1-1-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "LADAN+MTL_large.py", line 490, in <module>
    loss_value, _, graph_chose_value= sess.run([loss_total, train_op, graph_chose_loss], feed_dict=feed_dict)
  File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[118,256,15,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node gradients/mul_grad/Mul_1-1-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[Adam/update/_990]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[118,256,15,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node gradients/mul_grad/Mul_1-1-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

我对原代码的修改应该不多。原GitHub项目中缺失的law_label2index_large.pkl文件，我是根据CAIL-big数据集预处理后得到的new_law.txt经类似如下的操作得到的：

import argparse
import pickle as pk

parser = argparse.ArgumentParser()
parser.add_argument('--law_file')
parser.add_argument('--output_file')
args = parser.parse_args()

k={}
with open(args.law_file) as f:
    l=f.readlines()
    for i in range(len(l)):
        item=l[i]
        k[item.strip()]=i

pk.dump(k,open(args.output_file,'wb'))

prometheusXN commented 1 year ago

不知道你的显存有多大，如果不行的话，我建议你使用tensorflow的Dataloader处理数据，而不是一次性把所有的数据吃进显存。这个模型本身是非常小的。你也可以关注我们github上后续发布的tensorflow 2.x的版本，这个版本会随着我们的期刊工作一起发布。

@.***

发件人： PolarisRisingWar 发送时间： 2022-12-12 15:08 收件人： prometheusXN/LADAN 抄送： Subscribed 主题： [prometheusXN/LADAN] 运行LADAN+MTL_large.py时出现很长时间没跑完一个epoch后OOM的问题 (Issue #12) 我在运行LADAN+MTL_larg.py时，在运行20小时后还没有跑出一个epoch的结果，而且还报了OOM。我batch size已经改得很小了，想问问影响显存占用量的还会有什么其他因素吗？我不常用TensorFlow，遇到类似情况能有什么办法来快速debug吗？我的报错信息大概是这样： /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) Using TensorFlow backend. WARNING:tensorflow:From LADAN+MTL_large.py:187: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead. WARNING:tensorflow:From LADAN+MTL_large.py:240: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead. WARNING:tensorflow:From /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/keras/backend.py:3794: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From LADAN+MTL_large.py:347: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version. Instructions for updating: Future major versions of TensorFlow will allow gradients to flow into the labels input on backprop by default. See tf.nn.softmax_cross_entropy_with_logits_v2. WARNING:tensorflow:From LADAN+MTL_large.py:349: arg_max (from tensorflow.python.ops.gen_math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.math.argmax instead WARNING:tensorflow:From LADAN+MTL_large.py:376: The name tf.losses.softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.softmax_cross_entropy instead. WARNING:tensorflow:From LADAN+MTL_large.py:412: The name tf.add_to_collection is deprecated. Please use tf.compat.v1.add_to_collection instead. WARNING:tensorflow:From LADAN+MTL_large.py:416: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead. WARNING:tensorflow:From LADAN+MTL_large.py:420: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead. WARNING:tensorflow:From LADAN+MTL_large.py:429: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead. 2022-12-08 16:40:23.383015: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2022-12-08 16:40:23.426005: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 1800000000 Hz 2022-12-08 16:40:23.430214: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b5f54978e0 executing computations on platform Host. Devices: 2022-12-08 16:40:23.430288: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): , 2022-12-08 16:40:23.436761: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1 2022-12-08 16:40:26.878260: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b5f53210d0 executing computations on platform CUDA. Devices: 2022-12-08 16:40:26.878370: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Tesla T4, Compute Capability 7.5 2022-12-08 16:40:26.880622: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59 pciBusID: 0000:3b:00.0 2022-12-08 16:40:26.881237: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2022-12-08 16:40:26.884967: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 2022-12-08 16:40:26.887944: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0 2022-12-08 16:40:26.888472: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0 2022-12-08 16:40:26.891360: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0 2022-12-08 16:40:26.893620: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0 2022-12-08 16:40:26.898651: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2022-12-08 16:40:26.899852: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0 2022-12-08 16:40:26.899903: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2022-12-08 16:40:26.901230: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-12-08 16:40:26.901252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 2022-12-08 16:40:26.901259: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N 2022-12-08 16:40:26.902515: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3777 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:3b:00.0, compute capability: 7.5) 2022-12-08 16:40:28.982252: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. 2022-12-08 16:41:14.202769: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 2022-12-09 12:48:53.798908: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Allocator (GPU_0_bfc) ran out of memory trying to allocate 172.85MiB (rounded to 181248000). Current allocation summary follows. 2022-12-09 12:48:53.799598: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (256): Total Chunks: 87, Chunks in use: 86. 21.8KiB allocated for chunks. 21.5KiB in use in bin. 624B client-requested in use in bin. 2022-12-09 12:48:53.799637: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (512): Total Chunks: 21, Chunks in use: 20. 12.0KiB allocated for chunks. 11.2KiB in use in bin. 8.5KiB client-requested in use in bin. 2022-12-09 12:48:53.799654: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (1024): Total Chunks: 81, Chunks in use: 80. 94.5KiB allocated for chunks. 93.2KiB in use in bin. 91.2KiB client-requested in use in bin. 2022-12-09 12:48:53.799669: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (2048): Total Chunks: 13, Chunks in use: 12. 29.0KiB allocated for chunks. 25.2KiB in use in bin. 21.0KiB client-requested in use in bin. 2022-12-09 12:48:53.799683: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (4096): Total Chunks: 9, Chunks in use: 8. 43.0KiB allocated for chunks. 38.0KiB in use in bin. 37.8KiB client-requested in use in bin. 2022-12-09 12:48:53.799699: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (8192): Total Chunks: 95, Chunks in use: 66. 1.32MiB allocated for chunks. 979.5KiB in use in bin. 978.0KiB client-requested in use in bin. （中间的报错内容类似） 2022-12-09 12:48:53.799908: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (67108864): Total Chunks: 3, Chunks in use: 0. 259.28MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2022-12-09 12:48:53.799922: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (134217728): Total Chunks: 10, Chunks in use: 9. 1.59GiB allocated for chunks. 1.43GiB in use in bin. 1.43GiB client-requested in use in bin. 2022-12-09 12:48:53.799936: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (268435456): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2022-12-09 12:48:53.799953: I tensorflow/core/common_runtime/bfc_allocator.cc:780] Bin for 172.85MiB was 128.00MiB, Chunk State: 2022-12-09 12:48:53.799981: I tensorflow/core/common_runtime/bfc_allocator.cc:786] Size: 160.07MiB | Requested Size: 60.0KiB | in_use: 0 | bin_num: 19, prev: Size: 172.85MiB | Requested Size: 172.85MiB | in_use: 1 | bin_num: -1 2022-12-09 12:48:53.799993: I tensorflow/core/common_runtime/bfc_allocator.cc:793] Next region of size 1780940800 2022-12-09 12:48:53.800006: I tensorflow/core/common_runtime/bfc_allocator.cc:800] Free at 0x7fa1e4000000 next 5199 of size 24576000 2022-12-09 12:48:53.800018: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fa1e5770000 next 620 of size 906240 2022-12-09 12:48:53.800029: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fa1e584d400 next 2630 of size 226560 （中间的报错内容类似） 2022-12-09 12:48:53.849287: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 91136 totalling 89.0KiB 2022-12-09 12:48:53.849295: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 2 Chunks of size 91392 totalling 178.5KiB （一大堆类似的报错内容） 2022-12-09 12:48:53.851402: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 2 Chunks of size 134217728 totalling 256.00MiB 2022-12-09 12:48:53.851410: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 7 Chunks of size 181248000 totalling 1.18GiB 2022-12-09 12:48:53.851417: I tensorflow/core/common_runtime/bfc_allocator.cc:816] Sum Total of in-use chunks: 3.04GiB 2022-12-09 12:48:53.851426: I tensorflow/core/common_runtime/bfc_allocator.cc:818] total_region_allocatedbytes: 3960930304 memorylimit: 3960930304 available bytes: 0 curr_region_allocationbytes: 4294967296 2022-12-09 12:48:53.851444: I tensorflow/core/common_runtime/bfc_allocator.cc:824] Stats: Limit: 3960930304 InUse: 3267110912 MaxInUse: 3686056448 NumAllocs: 2608654986 MaxAllocSize: 315457792 2022-12-09 12:48:53.852011: W tensorflow/core/common_runtime/bfc_allocator.cc:319] *__******___**** 2022-12-09 12:48:53.852070: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at transpose_op.cc:199 : Resource exhausted: OOM when allocating tensor with shape[118,256,15,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc 程序开始运行时间： 2022-12-08 16:39:35.666083 Model loaded succeed Model loaded succeed Traceback (most recent call last): File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call return fn(*args) File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[118,256,15,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node gradients/mul_grad/Mul_1-1-TransposeNHWCToNCHW-LayoutOptimizer}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [[Adam/update/_990]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. (1) Resource exhausted: OOM when allocating tensor with shape[118,256,15,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node gradients/mul_grad/Mul_1-1-TransposeNHWCToNCHW-LayoutOptimizer}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. 0 successful operations. 0 derived errors ignored. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "LADAN+MTL_large.py", line 490, in lossvalue, , graph_chose_value= sess.run([loss_total, train_op, graph_chose_loss], feed_dict=feed_dict) File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run run_metadata_ptr) File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run feed_dict_tensor, options, run_metadata) File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run run_metadata) File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[118,256,15,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node gradients/mul_grad/Mul_1-1-TransposeNHWCToNCHW-LayoutOptimizer}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [[Adam/update/_990]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. (1) Resource exhausted: OOM when allocating tensor with shape[118,256,15,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node gradients/mul_grad/Mul_1-1-TransposeNHWCToNCHW-LayoutOptimizer}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. 0 successful operations. 0 derived errors ignored.

我对原代码的修改应该不多。原GitHub项目中缺失的law_label2index_large.pkl文件，我是根据CAIL-big数据集预处理后得到的new_law.txt经类似如下的操作得到的： import argparse import pickle as pk parser = argparse.ArgumentParser() parser.add_argument('--law_file') parser.add_argument('--output_file') args = parser.parse_args() k={} with open(args.law_file) as f: l=f.readlines() for i in range(len(l)): item=l[i] k[item.strip()]=i pk.dump(k,open(args.output_file,'wb')) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

prometheusXN commented 1 year ago

理论上来说，只要你能训练small数据，在同样的batch_size的设置下，large数据也是可以训练的，所以你可能需要关注是不是数据对显存的占用太多了

@.***

发件人： @.*** 发送时间： 2022-12-13 11:10 收件人： prometheusXN/LADAN 主题： Re: [prometheusXN/LADAN] 运行LADAN+MTL_large.py时出现很长时间没跑完一个epoch后OOM的问题 (Issue #12)

不知道你的显存有多大，如果不行的话，我建议你使用tensorflow的Dataloader处理数据，而不是一次性把所有的数据吃进显存。这个模型本身是非常小的。你也可以关注我们github上后续发布的tensorflow 2.x的版本，这个版本会随着我们的期刊工作一起发布。

@.***

发件人： PolarisRisingWar 发送时间： 2022-12-12 15:08 收件人： prometheusXN/LADAN 抄送： Subscribed 主题： [prometheusXN/LADAN] 运行LADAN+MTL_large.py时出现很长时间没跑完一个epoch后OOM的问题 (Issue #12) 我在运行LADAN+MTL_larg.py时，在运行20小时后还没有跑出一个epoch的结果，而且还报了OOM。我batch size已经改得很小了，想问问影响显存占用量的还会有什么其他因素吗？我不常用TensorFlow，遇到类似情况能有什么办法来快速debug吗？我的报错信息大概是这样： /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) Using TensorFlow backend. WARNING:tensorflow:From LADAN+MTL_large.py:187: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead. WARNING:tensorflow:From LADAN+MTL_large.py:240: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead. WARNING:tensorflow:From /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/keras/backend.py:3794: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From LADAN+MTL_large.py:347: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version. Instructions for updating: Future major versions of TensorFlow will allow gradients to flow into the labels input on backprop by default. See tf.nn.softmax_cross_entropy_with_logits_v2. WARNING:tensorflow:From LADAN+MTL_large.py:349: arg_max (from tensorflow.python.ops.gen_math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.math.argmax instead WARNING:tensorflow:From LADAN+MTL_large.py:376: The name tf.losses.softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.softmax_cross_entropy instead. WARNING:tensorflow:From LADAN+MTL_large.py:412: The name tf.add_to_collection is deprecated. Please use tf.compat.v1.add_to_collection instead. WARNING:tensorflow:From LADAN+MTL_large.py:416: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead. WARNING:tensorflow:From LADAN+MTL_large.py:420: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead. WARNING:tensorflow:From LADAN+MTL_large.py:429: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead. 2022-12-08 16:40:23.383015: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2022-12-08 16:40:23.426005: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 1800000000 Hz 2022-12-08 16:40:23.430214: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b5f54978e0 executing computations on platform Host. Devices: 2022-12-08 16:40:23.430288: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): , 2022-12-08 16:40:23.436761: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1 2022-12-08 16:40:26.878260: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b5f53210d0 executing computations on platform CUDA. Devices: 2022-12-08 16:40:26.878370: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Tesla T4, Compute Capability 7.5 2022-12-08 16:40:26.880622: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59 pciBusID: 0000:3b:00.0 2022-12-08 16:40:26.881237: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2022-12-08 16:40:26.884967: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 2022-12-08 16:40:26.887944: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0 2022-12-08 16:40:26.888472: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0 2022-12-08 16:40:26.891360: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0 2022-12-08 16:40:26.893620: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0 2022-12-08 16:40:26.898651: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2022-12-08 16:40:26.899852: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0 2022-12-08 16:40:26.899903: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2022-12-08 16:40:26.901230: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-12-08 16:40:26.901252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 2022-12-08 16:40:26.901259: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N 2022-12-08 16:40:26.902515: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3777 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:3b:00.0, compute capability: 7.5) 2022-12-08 16:40:28.982252: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. 2022-12-08 16:41:14.202769: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 2022-12-09 12:48:53.798908: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Allocator (GPU_0_bfc) ran out of memory trying to allocate 172.85MiB (rounded to 181248000). Current allocation summary follows. 2022-12-09 12:48:53.799598: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (256): Total Chunks: 87, Chunks in use: 86. 21.8KiB allocated for chunks. 21.5KiB in use in bin. 624B client-requested in use in bin. 2022-12-09 12:48:53.799637: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (512): Total Chunks: 21, Chunks in use: 20. 12.0KiB allocated for chunks. 11.2KiB in use in bin. 8.5KiB client-requested in use in bin. 2022-12-09 12:48:53.799654: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (1024): Total Chunks: 81, Chunks in use: 80. 94.5KiB allocated for chunks. 93.2KiB in use in bin. 91.2KiB client-requested in use in bin. 2022-12-09 12:48:53.799669: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (2048): Total Chunks: 13, Chunks in use: 12. 29.0KiB allocated for chunks. 25.2KiB in use in bin. 21.0KiB client-requested in use in bin. 2022-12-09 12:48:53.799683: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (4096): Total Chunks: 9, Chunks in use: 8. 43.0KiB allocated for chunks. 38.0KiB in use in bin. 37.8KiB client-requested in use in bin. 2022-12-09 12:48:53.799699: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (8192): Total Chunks: 95, Chunks in use: 66. 1.32MiB allocated for chunks. 979.5KiB in use in bin. 978.0KiB client-requested in use in bin. （中间的报错内容类似） 2022-12-09 12:48:53.799908: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (67108864): Total Chunks: 3, Chunks in use: 0. 259.28MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2022-12-09 12:48:53.799922: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (134217728): Total Chunks: 10, Chunks in use: 9. 1.59GiB allocated for chunks. 1.43GiB in use in bin. 1.43GiB client-requested in use in bin. 2022-12-09 12:48:53.799936: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (268435456): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2022-12-09 12:48:53.799953: I tensorflow/core/common_runtime/bfc_allocator.cc:780] Bin for 172.85MiB was 128.00MiB, Chunk State: 2022-12-09 12:48:53.799981: I tensorflow/core/common_runtime/bfc_allocator.cc:786] Size: 160.07MiB | Requested Size: 60.0KiB | in_use: 0 | bin_num: 19, prev: Size: 172.85MiB | Requested Size: 172.85MiB | in_use: 1 | bin_num: -1 2022-12-09 12:48:53.799993: I tensorflow/core/common_runtime/bfc_allocator.cc:793] Next region of size 1780940800 2022-12-09 12:48:53.800006: I tensorflow/core/common_runtime/bfc_allocator.cc:800] Free at 0x7fa1e4000000 next 5199 of size 24576000 2022-12-09 12:48:53.800018: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fa1e5770000 next 620 of size 906240 2022-12-09 12:48:53.800029: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fa1e584d400 next 2630 of size 226560 （中间的报错内容类似） 2022-12-09 12:48:53.849287: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 91136 totalling 89.0KiB 2022-12-09 12:48:53.849295: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 2 Chunks of size 91392 totalling 178.5KiB （一大堆类似的报错内容） 2022-12-09 12:48:53.851402: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 2 Chunks of size 134217728 totalling 256.00MiB 2022-12-09 12:48:53.851410: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 7 Chunks of size 181248000 totalling 1.18GiB 2022-12-09 12:48:53.851417: I tensorflow/core/common_runtime/bfc_allocator.cc:816] Sum Total of in-use chunks: 3.04GiB 2022-12-09 12:48:53.851426: I tensorflow/core/common_runtime/bfc_allocator.cc:818] total_region_allocatedbytes: 3960930304 memorylimit: 3960930304 available bytes: 0 curr_region_allocationbytes: 4294967296 2022-12-09 12:48:53.851444: I tensorflow/core/common_runtime/bfc_allocator.cc:824] Stats: Limit: 3960930304 InUse: 3267110912 MaxInUse: 3686056448 NumAllocs: 2608654986 MaxAllocSize: 315457792 2022-12-09 12:48:53.852011: W tensorflow/core/common_runtime/bfc_allocator.cc:319] *__******___**** 2022-12-09 12:48:53.852070: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at transpose_op.cc:199 : Resource exhausted: OOM when allocating tensor with shape[118,256,15,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc 程序开始运行时间： 2022-12-08 16:39:35.666083 Model loaded succeed Model loaded succeed Traceback (most recent call last): File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call return fn(*args) File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[118,256,15,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node gradients/mul_grad/Mul_1-1-TransposeNHWCToNCHW-LayoutOptimizer}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [[Adam/update/_990]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. (1) Resource exhausted: OOM when allocating tensor with shape[118,256,15,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node gradients/mul_grad/Mul_1-1-TransposeNHWCToNCHW-LayoutOptimizer}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. 0 successful operations. 0 derived errors ignored. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "LADAN+MTL_large.py", line 490, in lossvalue, , graph_chose_value= sess.run([loss_total, train_op, graph_chose_loss], feed_dict=feed_dict) File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run run_metadata_ptr) File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run feed_dict_tensor, options, run_metadata) File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run run_metadata) File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[118,256,15,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node gradients/mul_grad/Mul_1-1-TransposeNHWCToNCHW-LayoutOptimizer}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [[Adam/update/_990]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. (1) Resource exhausted: OOM when allocating tensor with shape[118,256,15,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node gradients/mul_grad/Mul_1-1-TransposeNHWCToNCHW-LayoutOptimizer}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. 0 successful operations. 0 derived errors ignored.

我对原代码的修改应该不多。原GitHub项目中缺失的law_label2index_large.pkl文件，我是根据CAIL-big数据集预处理后得到的new_law.txt经类似如下的操作得到的： import argparse import pickle as pk parser = argparse.ArgumentParser() parser.add_argument('--law_file') parser.add_argument('--output_file') args = parser.parse_args() k={} with open(args.law_file) as f: l=f.readlines() for i in range(len(l)): item=l[i] k[item.strip()]=i pk.dump(k,open(args.output_file,'wb')) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

prometheusXN commented 1 year ago

如果仍然OOM的话，你就需要关注你修改部分是否有循环定义placeholder的情况，如果你在tensorflow的计算图上有额外placeholder的产生（一般是for循环导致的），那么你的程序在运行几个step之后，由于显存的不断占用（新定义的placeholder时，tensorflow并不会释放已经不参与计算的placeholder的显存占用），同样会导致OOM，所以你可以自查一下

@.***

发件人： @.*** 发送时间： 2022-12-13 11:12 收件人： prometheusXN/LADAN 主题： Re: Re: [prometheusXN/LADAN] 运行LADAN+MTL_large.py时出现很长时间没跑完一个epoch后OOM的问题 (Issue #12) 理论上来说，只要你能训练small数据，在同样的batch_size的设置下，large数据也是可以训练的，所以你可能需要关注是不是数据对显存的占用太多了

@.***

发件人： @.*** 发送时间： 2022-12-13 11:10 收件人： prometheusXN/LADAN 主题： Re: [prometheusXN/LADAN] 运行LADAN+MTL_large.py时出现很长时间没跑完一个epoch后OOM的问题 (Issue #12)

不知道你的显存有多大，如果不行的话，我建议你使用tensorflow的Dataloader处理数据，而不是一次性把所有的数据吃进显存。这个模型本身是非常小的。你也可以关注我们github上后续发布的tensorflow 2.x的版本，这个版本会随着我们的期刊工作一起发布。

@.***

发件人： PolarisRisingWar 发送时间： 2022-12-12 15:08 收件人： prometheusXN/LADAN 抄送： Subscribed 主题： [prometheusXN/LADAN] 运行LADAN+MTL_large.py时出现很长时间没跑完一个epoch后OOM的问题 (Issue #12) 我在运行LADAN+MTL_larg.py时，在运行20小时后还没有跑出一个epoch的结果，而且还报了OOM。我batch size已经改得很小了，想问问影响显存占用量的还会有什么其他因素吗？我不常用TensorFlow，遇到类似情况能有什么办法来快速debug吗？我的报错信息大概是这样： /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)]) Using TensorFlow backend. WARNING:tensorflow:From LADAN+MTL_large.py:187: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead. WARNING:tensorflow:From LADAN+MTL_large.py:240: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead. WARNING:tensorflow:From /home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/keras/backend.py:3794: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From LADAN+MTL_large.py:347: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version. Instructions for updating: Future major versions of TensorFlow will allow gradients to flow into the labels input on backprop by default. See tf.nn.softmax_cross_entropy_with_logits_v2. WARNING:tensorflow:From LADAN+MTL_large.py:349: arg_max (from tensorflow.python.ops.gen_math_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.math.argmax instead WARNING:tensorflow:From LADAN+MTL_large.py:376: The name tf.losses.softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.softmax_cross_entropy instead. WARNING:tensorflow:From LADAN+MTL_large.py:412: The name tf.add_to_collection is deprecated. Please use tf.compat.v1.add_to_collection instead. WARNING:tensorflow:From LADAN+MTL_large.py:416: The name tf.get_collection is deprecated. Please use tf.compat.v1.get_collection instead. WARNING:tensorflow:From LADAN+MTL_large.py:420: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead. WARNING:tensorflow:From LADAN+MTL_large.py:429: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead. 2022-12-08 16:40:23.383015: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2022-12-08 16:40:23.426005: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 1800000000 Hz 2022-12-08 16:40:23.430214: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b5f54978e0 executing computations on platform Host. Devices: 2022-12-08 16:40:23.430288: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): , 2022-12-08 16:40:23.436761: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1 2022-12-08 16:40:26.878260: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b5f53210d0 executing computations on platform CUDA. Devices: 2022-12-08 16:40:26.878370: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Tesla T4, Compute Capability 7.5 2022-12-08 16:40:26.880622: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59 pciBusID: 0000:3b:00.0 2022-12-08 16:40:26.881237: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2022-12-08 16:40:26.884967: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 2022-12-08 16:40:26.887944: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10.0 2022-12-08 16:40:26.888472: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10.0 2022-12-08 16:40:26.891360: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10.0 2022-12-08 16:40:26.893620: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10.0 2022-12-08 16:40:26.898651: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2022-12-08 16:40:26.899852: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0 2022-12-08 16:40:26.899903: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0 2022-12-08 16:40:26.901230: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-12-08 16:40:26.901252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 2022-12-08 16:40:26.901259: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N 2022-12-08 16:40:26.902515: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3777 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:3b:00.0, compute capability: 7.5) 2022-12-08 16:40:28.982252: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. 2022-12-08 16:41:14.202769: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0 2022-12-09 12:48:53.798908: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Allocator (GPU_0_bfc) ran out of memory trying to allocate 172.85MiB (rounded to 181248000). Current allocation summary follows. 2022-12-09 12:48:53.799598: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (256): Total Chunks: 87, Chunks in use: 86. 21.8KiB allocated for chunks. 21.5KiB in use in bin. 624B client-requested in use in bin. 2022-12-09 12:48:53.799637: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (512): Total Chunks: 21, Chunks in use: 20. 12.0KiB allocated for chunks. 11.2KiB in use in bin. 8.5KiB client-requested in use in bin. 2022-12-09 12:48:53.799654: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (1024): Total Chunks: 81, Chunks in use: 80. 94.5KiB allocated for chunks. 93.2KiB in use in bin. 91.2KiB client-requested in use in bin. 2022-12-09 12:48:53.799669: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (2048): Total Chunks: 13, Chunks in use: 12. 29.0KiB allocated for chunks. 25.2KiB in use in bin. 21.0KiB client-requested in use in bin. 2022-12-09 12:48:53.799683: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (4096): Total Chunks: 9, Chunks in use: 8. 43.0KiB allocated for chunks. 38.0KiB in use in bin. 37.8KiB client-requested in use in bin. 2022-12-09 12:48:53.799699: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (8192): Total Chunks: 95, Chunks in use: 66. 1.32MiB allocated for chunks. 979.5KiB in use in bin. 978.0KiB client-requested in use in bin. （中间的报错内容类似） 2022-12-09 12:48:53.799908: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (67108864): Total Chunks: 3, Chunks in use: 0. 259.28MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2022-12-09 12:48:53.799922: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (134217728): Total Chunks: 10, Chunks in use: 9. 1.59GiB allocated for chunks. 1.43GiB in use in bin. 1.43GiB client-requested in use in bin. 2022-12-09 12:48:53.799936: I tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (268435456): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin. 2022-12-09 12:48:53.799953: I tensorflow/core/common_runtime/bfc_allocator.cc:780] Bin for 172.85MiB was 128.00MiB, Chunk State: 2022-12-09 12:48:53.799981: I tensorflow/core/common_runtime/bfc_allocator.cc:786] Size: 160.07MiB | Requested Size: 60.0KiB | in_use: 0 | bin_num: 19, prev: Size: 172.85MiB | Requested Size: 172.85MiB | in_use: 1 | bin_num: -1 2022-12-09 12:48:53.799993: I tensorflow/core/common_runtime/bfc_allocator.cc:793] Next region of size 1780940800 2022-12-09 12:48:53.800006: I tensorflow/core/common_runtime/bfc_allocator.cc:800] Free at 0x7fa1e4000000 next 5199 of size 24576000 2022-12-09 12:48:53.800018: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fa1e5770000 next 620 of size 906240 2022-12-09 12:48:53.800029: I tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0x7fa1e584d400 next 2630 of size 226560 （中间的报错内容类似） 2022-12-09 12:48:53.849287: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 1 Chunks of size 91136 totalling 89.0KiB 2022-12-09 12:48:53.849295: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 2 Chunks of size 91392 totalling 178.5KiB （一大堆类似的报错内容） 2022-12-09 12:48:53.851402: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 2 Chunks of size 134217728 totalling 256.00MiB 2022-12-09 12:48:53.851410: I tensorflow/core/common_runtime/bfc_allocator.cc:812] 7 Chunks of size 181248000 totalling 1.18GiB 2022-12-09 12:48:53.851417: I tensorflow/core/common_runtime/bfc_allocator.cc:816] Sum Total of in-use chunks: 3.04GiB 2022-12-09 12:48:53.851426: I tensorflow/core/common_runtime/bfc_allocator.cc:818] total_region_allocatedbytes: 3960930304 memorylimit: 3960930304 available bytes: 0 curr_region_allocationbytes: 4294967296 2022-12-09 12:48:53.851444: I tensorflow/core/common_runtime/bfc_allocator.cc:824] Stats: Limit: 3960930304 InUse: 3267110912 MaxInUse: 3686056448 NumAllocs: 2608654986 MaxAllocSize: 315457792 2022-12-09 12:48:53.852011: W tensorflow/core/common_runtime/bfc_allocator.cc:319] *__******___**** 2022-12-09 12:48:53.852070: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at transpose_op.cc:199 : Resource exhausted: OOM when allocating tensor with shape[118,256,15,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc 程序开始运行时间： 2022-12-08 16:39:35.666083 Model loaded succeed Model loaded succeed Traceback (most recent call last): File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356, in _do_call return fn(*args) File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[118,256,15,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node gradients/mul_grad/Mul_1-1-TransposeNHWCToNCHW-LayoutOptimizer}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [[Adam/update/_990]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. (1) Resource exhausted: OOM when allocating tensor with shape[118,256,15,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node gradients/mul_grad/Mul_1-1-TransposeNHWCToNCHW-LayoutOptimizer}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. 0 successful operations. 0 derived errors ignored. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "LADAN+MTL_large.py", line 490, in lossvalue, , graph_chose_value= sess.run([loss_total, train_op, graph_chose_loss], feed_dict=feed_dict) File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950, in run run_metadata_ptr) File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173, in _run feed_dict_tensor, options, run_metadata) File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _do_run run_metadata) File "/home/wanghuijuan/anaconda3/envs/envtf114/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[118,256,15,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node gradients/mul_grad/Mul_1-1-TransposeNHWCToNCHW-LayoutOptimizer}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [[Adam/update/_990]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. (1) Resource exhausted: OOM when allocating tensor with shape[118,256,15,100] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node gradients/mul_grad/Mul_1-1-TransposeNHWCToNCHW-LayoutOptimizer}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. 0 successful operations. 0 derived errors ignored.

我对原代码的修改应该不多。原GitHub项目中缺失的law_label2index_large.pkl文件，我是根据CAIL-big数据集预处理后得到的new_law.txt经类似如下的操作得到的： import argparse import pickle as pk parser = argparse.ArgumentParser() parser.add_argument('--law_file') parser.add_argument('--output_file') args = parser.parse_args() k={} with open(args.law_file) as f: l=f.readlines() for i in range(len(l)): item=l[i] k[item.strip()]=i pk.dump(k,open(args.output_file,'wb')) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

PolarisRisingWar commented 1 year ago

好的……我现在对TensorFlow的操作几乎完全不熟，我可能需要晚些再看看怎么调bug。我这边的GPU单卡是15109MiB。

PolarisRisingWar commented 1 year ago

请问TensorFlow 2.x版本公布了吗？

prometheusXN commented 1 year ago

tensorflow 2.x的版本是有的，我们目前转投期刊的时候，用tensorflow2.x重构了LADAN，只是暂时还没有时间整理并发布出来，稍后会发布的。

@.***

发件人： PolarisRisingWar 发送时间： 2023-05-15 13:21 收件人： prometheusXN/LADAN 抄送： prometheusXN; Comment 主题： Re: [prometheusXN/LADAN] 运行LADAN+MTL_large.py时出现很长时间没跑完一个epoch后OOM的问题 (Issue #12) 请问TensorFlow 2.x版本公布了吗？ ― Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

prometheusXN commented 1 year ago

我这两天尽量整理并开源出来。

@.***

发件人： PolarisRisingWar 发送时间： 2023-05-15 13:21 收件人： prometheusXN/LADAN 抄送： prometheusXN; Comment 主题： Re: [prometheusXN/LADAN] 运行LADAN+MTL_large.py时出现很长时间没跑完一个epoch后OOM的问题 (Issue #12) 请问TensorFlow 2.x版本公布了吗？ ― Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

prometheusXN commented 1 year ago

我们目前已经开源了一个简易的版本，你可以在这个链接（https://github.com/prometheusXN/D-LADAN）下找到我们tf2.x版本的LADAN模型。

@.***

发件人： PolarisRisingWar 发送时间： 2023-05-15 13:21 收件人： prometheusXN/LADAN 抄送： prometheusXN; Comment 主题： Re: [prometheusXN/LADAN] 运行LADAN+MTL_large.py时出现很长时间没跑完一个epoch后OOM的问题 (Issue #12) 请问TensorFlow 2.x版本公布了吗？ — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

prometheusXN / LADAN

运行LADAN+MTL_large.py时出现很长时间没跑完一个epoch后OOM的问题 #12