plw-study / Reproduction_of_MCAN

This is the reproduction of MCAN from paper in ACL 2021: "Multimodal Fusion with Co-Attention Networks for Fake News Detection"
27 stars 3 forks source link

关于运行时的报错 #3

Open Fumon554 opened 6 months ago

Fumon554 commented 6 months ago

您好,我在运行python MCAN_reproduction.py命令后出现了如下报错信息:好像是内存不够的原因,请问您的运行环境内存是多少

2024-03-29 16:05:17,682 INFO worker.py:1518 -- Started a local Ray instance. (TrainALL pid=8621) Calling BertTokenizer.from_pretrained() with the path to a single file or url is deprecated (TrainALL pid=8621) Calling BertTokenizer.from_pretrained() with the path to a single file or url is deprecated (TrainALL pid=8621) Please check your arguments if you have upgraded adabelief-pytorch from version 0.0.5. (TrainALL pid=8621) Modifications to default arguments: (TrainALL pid=8621) eps weight_decouple rectify (TrainALL pid=8621) ----------------------- ----- ----------------- --------- (TrainALL pid=8621) adabelief-pytorch=0.0.5 1e-08 False False (TrainALL pid=8621) Current version (0.1.0) 1e-16 True True (TrainALL pid=8621) For a complete table of recommended hyperparameters, see (TrainALL pid=8621) https://github.com/juntang-zhuang/Adabelief-Optimizer (TrainALL pid=8621) (TrainALL pid=8621) Weight decoupling enabled in AdaBelief (TrainALL pid=8621) epoch:0-------------------- 2024-03-29 16:06:31,940 WARNING worker.py:1829 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffb5441fb9217c28c6ab146ca501000000 Worker ID: f45e4768d78fe2c220f1a6b5f63a8c062dd19faeea70580b7b6d1615 Node ID: 221fc6fe418c59448bc3bda4e939f5a47b80f08a6b5f2317a6dad29f Worker IP address: 172.29.147.200 Worker port: 39641 Worker PID: 8621 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. 2024-03-29 16:06:31,993 ERROR trial_runner.py:987 -- Trial TrainALL_153bb_00000: Error processing event. ray.tune.error._TuneNoNextExecutorEventError: Traceback (most recent call last): File "/home/fumon/anaconda3/envs/MCAN/lib/python3.8/site-packages/ray/tune/execution/ray_trial_executor.py", line 996, in get_next_executor_event future_result = ray.get(ready_future) File "/home/fumon/anaconda3/envs/MCAN/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, **kwargs) File "/home/fumon/anaconda3/envs/MCAN/lib/python3.8/site-packages/ray/_private/worker.py", line 2282, in get raise value ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. class_name: TrainALL actor_id: b5441fb9217c28c6ab146ca501000000 pid: 8621 namespace: 4e626107-faec-41b6-b346-8dc98df164a9 ip: 172.29.147.200 The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

Result for TrainALL_153bb_00000: date: 2024-03-29_16-05-25 experiment_id: b7aee14d22604d25b3e1eabeb0782826 hostname: LAPTOP-2S5HFEN5 node_ip: 172.29.147.200 pid: 8621 timestamp: 1711699525 trial_id: 153bb_00000

2024-03-29 16:06:32,106 ERROR ray_trial_executor.py:103 -- An exception occurred when trying to stop the Ray actor:Traceback (most recent call last): File "/home/fumon/anaconda3/envs/MCAN/lib/python3.8/site-packages/ray/tune/execution/ray_trial_executor.py", line 94, in _post_stop_cleanup ray.get(future, timeout=0) File "/home/fumon/anaconda3/envs/MCAN/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, **kwargs) File "/home/fumon/anaconda3/envs/MCAN/lib/python3.8/site-packages/ray/_private/worker.py", line 2282, in get raise value ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. class_name: TrainALL actor_id: b5441fb9217c28c6ab146ca501000000 pid: 8621 namespace: 4e626107-faec-41b6-b346-8dc98df164a9 ip: 172.29.147.200 The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

Traceback (most recent call last): File "MCAN_reproduction.py", line 1265, in analysis = tune.run( File "/home/fumon/anaconda3/envs/MCAN/lib/python3.8/site-packages/ray/tune/tune.py", line 752, in run raise TuneError("Trials did not complete", incomplete_trials) ray.tune.error.TuneError: ('Trials did not complete', [TrainALL_153bb_00000])

184446223 commented 2 months ago

您好,我在运行python MCAN_reproduct.py命令后出现了如下报错信息:希望是内存不足的原因,请问您的运行环境内存是多少

2024-03-29 16:05:17,682 INFO worker.py:1518——启动了本地 Ray 实例。 (TrainALL pid=8621)使用单个文件或 url 的路径调用 BertTokenizer.from_pretrained() 已被弃用 (TrainALL pid=8621)使用单个文件或 url 的路径调用 BertTokenizer.from_pretrained() 已被弃用 (TrainALL pid=8621)如果您已将 adabelief-pytorch 从 0.0.5 版本升级,请检查您的参数。 (TrainALL pid=8621) 对默认参数的修改: (TrainALL pid=8621) eps weight_decouple rectify (TrainALL pid=8621) ----------------------- ----- ----------------- --------- (TrainALL pid=8621) adabelief-pytorch=0.0.5 1e-08 False False (TrainALL pid=8621) 当前版本 (0.1.0) 1e-16 True True (TrainALL pid=8621) 有关推荐超参数的完整表格,请参阅 (TrainALL pid=8621) https://github.com/juntang-zhuang/Adabelief-Optimizer (TrainALL pid=8621) (TrainALL pid=8621) 在 AdaBelief 中启用权重解耦 (TrainALL pid=8621) epoch:0-------------------- 2024-03-29 16:06:31,940 警告 worker.py:1829 -- 执行任务时,一名工作进程因意外系统错误死亡或被杀死。要解决问题,请检查死亡工作进程的日志。 RayTask ID:ffffffffffffffffb5441fb9217c28c6ab146ca501000000 Worker ID:f45e4768d78fe2c220f1a6b5f63a8c062dd19faeea70580b7b6d1615 节点 ID:221fc6fe418c59448bc3bda4e939f5a47b80f08a6b5f2317a6dad29f Worker IP 地址:172.29.147.200 Worker 端口:39641 Worker PID:8621 Worker 退出类型:SYSTEM_ERROR Worker 退出详情:Worker 意外退出,连接错误代码为 2。文件结束。存在一些潜在的根本原因。 (1) 由于内存使用率过高,进程被 OOM killer 用 SIGKILL 杀死。 (2) 调用 ray stop --force。 (3) 由于 SIGSEGV 或其他意外错误,工作器意外崩溃。 2024-03-29 16:06:31,993 ERROR trial_runner.py:987 -- Trial TrainALL_153bb_00000:处理事件时出错。 ray.tune.error._TuneNoNextExecutorEventError:回溯(最近一次调用最后一次): 文件“/home/fumon/anaconda3/envs/MCAN/lib/python3.8/site-packages/ray/tune/execution/ray_trial_executor.py”,第 996 行,在 get_next_executor_event future_result = ray.get(ready_future) 文件“/home/fumon/anaconda3/envs/MCAN/lib/python3.8/site-packages/ray/_private/client_mode_hook.py”,第 105 行,在包装器 返回 func(*args,**kwargs) 文件“/home/fumon/anaconda3/envs/MCAN/lib/python3.8/site-packages/ray/_private/worker.py”,第 2282 行,在获取 提升值 ray.exceptions.RayActorError:演员在完成此任务之前意外死亡。class_name :TrainALL actor_id:b5441fb9217c28c6ab146ca501000000 pid:8621 namespace:4e626107-faec-41b6-b346-8dc98df164a9 ip:172.29.147.200 Actor 已死亡,因为其工作进程已死亡。Worker 退出类型:SYSTEM_ERROR Worker 退出详细信息:Worker 意外退出,连接错误代码为 2。文件结束。有一些潜在的根本原因。(1)由于内存使用率过高,该进程被 OOM killer 以 SIGKILL 杀死。(2)调用 ray stop --force。(3)由于 SIGSEGV 或其他意外错误,Worker 意外崩溃。

TrainALL_153bb_00000 的结果: 日期:2024-03-29_16-05-25 experiment_id:b7aee14d22604d25b3e1eabeb0782826 主机名:LAPTOP-2S5HFEN5 node_ip:172.29.147.200 pid:8621 时间戳:1711699525 trial_id:153bb_00000

2024-03-29 16:06:32,106 错误 ray_trial_executor.py:103 - 尝试停止 Ray 参与者时发生异常:回溯(最近一次调用最后一次): 文件“/home/fumon/anaconda3/envs/MCAN/lib/python3.8/site-packages/ray/tune/execution/ray_trial_executor.py”,第 94 行,在 _post_stop_cleanup ray.get(future,timeout=0) 文件“/home/fumon/anaconda3/envs/MCAN/lib/python3.8/site-packages/ray/_private/client_mode_hook.py”,第 105 行,在包装器中 返回 func(*args,**kwargs) 文件“/home/fumon/anaconda3/envs/MCAN/lib/python3.8/site-packages/ray/_private/worker.py”,第 2282 行,在获取 提升值 ray.exceptions.RayActorError 中:actor 在完成此任务之前意外死亡。class_name :TrainALL actor_id:b5441fb9217c28c6ab146ca501000000 pid:8621 namespace:4e626107-faec-41b6-b346-8dc98df164a9 ip:172.29.147.200 该 actor 已死亡,因为其工作进程已死亡。Worker 退出类型:SYSTEM_ERROR Worker 退出详细信息:Worker 意外退出,连接错误代码为 2。文件结束。有一些潜在的根本原因。 (1) 由于内存使用率过高,进程被 OOM killer 用 SIGKILL 杀死。 (2) 调用 ray stop --force。 (3) 由于 SIGSEGV 或其他意外错误,工作器意外崩溃。

回溯(最近一次调用): 文件“MCAN_reproduction.py”,第 1265 行, 分析 = tune.run( 文件“/home/fumon/anaconda3/envs/MCAN/lib/python3.8/site-packages/ray/tune/tune.py”,第 752 行,运行中 引发 TuneError(“试验未完成”,incomplete_trials) ray.tune.error.TuneError:('试验未完成',[TrainALL_153bb_00000])

请问您复现出来推特数据集的结果了吗?

plw-study commented 2 months ago

@Fumon554 你好,我的运行环境是64G内存,24G显存。如果你的内存不够的话,可以调整ray的参数,或者降低batch-size,总之先把代码跑通再调准确率比较好。

plw-study commented 2 months ago

@184446223 你好,我复现的twitter数据集的结果是0.796,与论文原文的0.809不完全一致,这可能是由于数据处理方式导致的。