real-stanford / flingbot

[CoRL 2021 Best System Paper] This repository contains code for training and evaluating FlingBot in both simulation and real-world settings on a dual-UR5 robot arm setup for Ubuntu 18.04
https://flingbot.cs.columbia.edu/
106 stars 25 forks source link

OutOfMemoryError: Task was killed due to the node running low on memory. #11

Open zcswdt opened 8 months ago

zcswdt commented 8 months ago

Traceback (most recent call last): File "run_sim.py", line 152, in remaining_observations=remaining_observations) File "/home/zcs/work/train-my-fling/flingbot/utils.py", line 416, in step_env for obs, env_id in ray.get(step_retval): File "/home/zcs/miniconda3/envs/flingbot/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper return func(*args, *kwargs) File "/home/zcs/miniconda3/envs/flingbot/lib/python3.6/site-packages/ray/_private/worker.py", line 2523, in get raise value ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory. Memory on the node (IP: 192.168.0.107, ID: fc2befb2867ce88e73a8a45572c43a640751ae1f2b5e15bd8315f293) where the task (actor ID: f9cc340f5aef7b479d86345001000000, name=SimEnv.init, pid=4331, memory used=2.22GB) was running was 59.49GB / 62.58GB (0.950744), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: d98ac96cdd66ea8c0a2604609381c3256c8285b87822896c767f7714) because it was the most recently scheduled task; to see more information about memory usage on this node, use ray logs raylet.out -ip 192.168.0.107. To see the logs of the worker, use `ray logs worker-d98ac96cdd66ea8c0a2604609381c3256c8285b87822896c767f7714out -ip 192.168.0.107. Top 10 memory users: PID MEM(GB) COMMAND 7904 2.92 /home/zcs/work/software/pycharm-2023.2.5/jbr/bin/java -classpath /home/zcs/work/software/pycharm-202... 4312 2.22 ray::SimEnv 4331 2.22 ray::SimEnv 4253 2.17 ray::SimEnv 4288 2.15 ray::SimEnv 4252 2.15 ray::SimEnv 4268 2.14 ray::SimEnv.step 4302 2.13 ray::SimEnv.step 4279 2.13 ray::SimEnv.step 4296 2.12 ray::SimEnv Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable RAY_memory_usage_threshold when starting Ray. To disable worker killing, set the environment variable RAY_memory_monitor_refresh_ms to zero