关于运行时的报错 #3

Open Fumon554 opened 6 months ago

Fumon554 commented 6 months ago

您好,我在运行python MCAN_reproduction.py命令后出现了如下报错信息:好像是内存不够的原因,请问您的运行环境内存是多少

2024-03-29 16:06:31,940 WARNING -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

Result for TrainALL_153bb_00000: date: 2024-03-29_16-05-25 experiment_id: b7aee14d22604d25b3e1eabeb0782826 hostname: LAPTOP-2S5HFEN5 node_ip: pid: 8621 timestamp: 1711699525 trial_id: 153bb_00000

2024-03-29 16:06:32,106 ERROR -- An exception occurred when trying to stop the Ray actor:Traceback (most recent call last): File "/home/fumon/anaconda3/envs/MCAN/lib/python3.8/site-packages/ray/tune/execution/", line 94, in _post_stop_cleanup ray.get(future, timeout=0) File "/home/fumon/anaconda3/envs/MCAN/lib/python3.8/site-packages/ray/_private/", line 105, in wrapper return func(*args, **kwargs) File "/home/fumon/anaconda3/envs/MCAN/lib/python3.8/site-packages/ray/_private/", line 2282, in get raise value ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task. class_name: TrainALL actor_id: b5441fb9217c28c6ab146ca501000000 pid: 8621 namespace: 4e626107-faec-41b6-b346-8dc98df164a9 ip: The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

Traceback (most recent call last): File "", line 1265, in analysis = File "/home/fumon/anaconda3/envs/MCAN/lib/python3.8/site-packages/ray/tune/", line 752, in run raise TuneError("Trials did not complete", incomplete_trials) ray.tune.error.TuneError: ('Trials did not complete', [TrainALL_153bb_00000])

plw-study commented 2 months ago

@Fumon554 你好,我的运行环境是64G内存,24G显存。如果你的内存不够的话,可以调整ray的参数,或者降低batch-size,总之先把代码跑通再调准确率比较好。

plw-study commented 2 months ago

@184446223 你好,我复现的twitter数据集的结果是0.796,与论文原文的0.809不完全一致,这可能是由于数据处理方式导致的。