Closed iu777 closed 4 months ago
hey, can you paste your crash message on your learner?
FYI @youliangtan @charlesxu0124 it looks like the agentlace timeout issue is still here on the actor side, we should plan to fix that permanently.
Of course, I can paste all the information of the "bash run_learner. sh"
terminal below for your reference. Thank you:
(serl) aml@aml-NUC11PHi7:~/serl-main/examples/async_peg_insert_drq$ bash run_learner.sh
2024-07-01 16:12:52.652413: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
I0701 16:12:54.338428 140355677083456 _schedule.py:74] A polynomial schedule was set with a non-positive `transition_steps` value; this results in a constant schedule with value `init_value`.
I0701 16:12:54.338671 140355677083456 _schedule.py:74] A polynomial schedule was set with a non-positive `transition_steps` value; this results in a constant schedule with value `init_value`.
I0701 16:12:54.338775 140355677083456 _schedule.py:74] A polynomial schedule was set with a non-positive `transition_steps` value; this results in a constant schedule with value `init_value`.
/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/flax/core/lift.py:305: RuntimeWarning: kwargs are not supported in vmap, so "train" is(are) ignored
warnings.warn(msg.format(name, ', '.join(kwargs.keys())), RuntimeWarning)
The ResNet-10 weights already exist at '/home/aml/.serl/resnet10_params.pkl'.
Loaded 5.418792M parameters from ResNet-10 pretrained on ImageNet-1K
replaced conv_init in pretrained_encoder
replaced norm_init in pretrained_encoder
replaced ResNetBlock_0 in pretrained_encoder
replaced ResNetBlock_1 in pretrained_encoder
replaced ResNetBlock_2 in pretrained_encoder
replaced ResNetBlock_3 in pretrained_encoder
entity: null
exp_descriptor: serl_dev_drq_rlpd10demos_peg_insert_random_resnet_097
experiment_id: serl_dev_drq_rlpd10demos_peg_insert_random_resnet_097_20240701_161301
group: null
project: serl_dev
tag: serl_dev_drq_rlpd10demos_peg_insert_random_resnet_097
unique_identifier: '20240701_161301'
wandb: Currently logged in as: 345915750 (zpb345915750). Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.17.2
wandb: Run data is saved locally in /tmp/tmpy80et43_/wandb/run-20240701_161304-serl_dev_drq_rlpd10demos_peg_insert_random_resnet_097_20240701_161301
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run serl_dev_drq_rlpd10demos_peg_insert_random_resnet_097_20240701_161301
wandb: ⭐️ View project at https://wandb.ai/zpb345915750/serl_dev
wandb: 🚀 View run at https://wandb.ai/zpb345915750/serl_dev/runs/serl_dev_drq_rlpd10demos_peg_insert_random_resnet_097_20240701_161301
demo buffer size: 576
starting learner loop
Filling up replay buffer: 204it [01:09, 2.95it/s]
sent initial network to actor
learner: 0%| | 0/1000000 [00:00<?, ?it/s]/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/flax/core/lift.py:305: RuntimeWarning: kwargs are not supported in vmap, so "train" is(are) ignored
warnings.warn(msg.format(name, ', '.join(kwargs.keys())), RuntimeWarning)
/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/flax/core/lift.py:305: RuntimeWarning: kwargs are not supported in vmap, so "train" is(are) ignored
warnings.warn(msg.format(name, ', '.join(kwargs.keys())), RuntimeWarning)
/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/flax/core/lift.py:305: RuntimeWarning: kwargs are not supported in vmap, so "train" is(are) ignored
warnings.warn(msg.format(name, ', '.join(kwargs.keys())), RuntimeWarning)
/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/flax/core/lift.py:305: RuntimeWarning: kwargs are not supported in vmap, so "train" is(are) ignored
warnings.warn(msg.format(name, ', '.join(kwargs.keys())), RuntimeWarning)
/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/flax/core/lift.py:305: RuntimeWarning: kwargs are not supported in vmap, so "train" is(are) ignored
warnings.warn(msg.format(name, ', '.join(kwargs.keys())), RuntimeWarning)
/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/flax/core/lift.py:305: RuntimeWarning: kwargs are not supported in vmap, so "train" is(are) ignored
warnings.warn(msg.format(name, ', '.join(kwargs.keys())), RuntimeWarning)
/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/flax/core/lift.py:305: RuntimeWarning: kwargs are not supported in vmap, so "train" is(are) ignored
warnings.warn(msg.format(name, ', '.join(kwargs.keys())), RuntimeWarning)
I0701 16:16:54.329327 140355677083456 checkpoints.py:574] Saving checkpoint at step: 0
I0701 16:16:54.329805 140355677083456 checkpoints.py:662] Using Orbax as backend to save Flax checkpoints. For potential troubleshooting see: https://flax.readthedocs.io/en/latest/guides/training_techniques/use_checkpointing.html#orbax-as-backend-troubleshooting
W0701 16:16:54.331048 140355677083456 type_handlers.py:222] SaveArgs.aggregate is deprecated, please use custom TypeHandler (https://orbax.readthedocs.io/en/latest/custom_handlers.html#typehandler) or contact Orbax team to migrate before August 1st, 2024.
I0701 16:16:54.335811 140355677083456 checkpointer.py:157] Saving checkpoint to /home/aml/serl-main/examples/async_peg_insert_drq/peg_insert_20/checkpoint_0.
learner: 0%| | 0/1000000 [02:29<?, ?it/s]
Traceback (most recent call last):
File "/home/aml/serl-main/examples/async_peg_insert_drq/async_drq_randomized.py", line 415, in <module>
app.run(main)
File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/home/aml/serl-main/examples/async_peg_insert_drq/async_drq_randomized.py", line 394, in main
learner(
File "/home/aml/serl-main/examples/async_peg_insert_drq/async_drq_randomized.py", line 313, in learner
checkpoints.save_checkpoint(
File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/flax/training/checkpoints.py", line 697, in save_checkpoint
orbax_checkpointer.save(
File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/orbax/checkpoint/checkpointer.py", line 164, in save
raise ValueError(f'Destination {directory} already exists.')
ValueError: Destination /home/aml/serl-main/examples/async_peg_insert_drq/peg_insert_20/checkpoint_0 already exists.
wandb: - 0.033 MB of 0.033 MB uploaded
wandb: Run history:
wandb: actor/actor_loss ▁
wandb: actor/entropy ▁
wandb: actor/temperature ▁
wandb: actor_lr ▁
wandb: critic/critic_loss ▁
wandb: critic/predicted_qs ▁
wandb: critic/target_qs ▁
wandb: critic_lr ▁
wandb: temperature/temperature_loss ▁
wandb: temperature_lr ▁
wandb: timer/sample_actions ▁
wandb: timer/sample_replay_buffer ▁
wandb: timer/step_env ▁
wandb: timer/total ▁
wandb: timer/train ▁
wandb: timer/train_critics ▁
wandb:
wandb: Run summary:
wandb: actor/actor_loss 0.26329
wandb: actor/entropy -6.81702
wandb: actor/temperature 0.01
wandb: actor_lr 0.0003
wandb: critic/critic_loss 1.82864
wandb: critic/predicted_qs -0.2719
wandb: critic/target_qs -1.03865
wandb: critic_lr 0.0003
wandb: temperature/temperature_loss -0.03989
wandb: temperature_lr 0.0003
wandb: timer/sample_actions 0.05981
wandb: timer/sample_replay_buffer 0.05218
wandb: timer/step_env 0.1329
wandb: timer/total 0.20114
wandb: timer/train 59.3275
wandb: timer/train_critics 30.11736
wandb:
wandb: 🚀 View run serl_dev_drq_rlpd10demos_peg_insert_random_resnet_097_20240701_161301 at: https://wandb.ai/zpb345915750/serl_dev/runs/serl_dev_drq_rlpd10demos_peg_insert_random_resnet_097_20240701_161301
wandb: ⭐️ View project at: https://wandb.ai/zpb345915750/serl_dev
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 1 other file(s)
wandb: Find logs at: /tmp/tmpy80et43_/wandb/run-20240701_161304-serl_dev_drq_rlpd10demos_peg_insert_random_resnet_097_20240701_161301/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with `wandb.require("core")`! See https://wandb.me/wandb-core for more information.
Exception in thread Thread-4 (run):
Traceback (most recent call last):
File "/home/aml/anaconda3/envs/serl/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/aml/anaconda3/envs/serl/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/aml/agentlace-main/agentlace/zmq_wrapper/req_rep.py", line 49, in run
res = self.impl_callback(message)
File "/home/aml/agentlace-main/agentlace/trainer.py", line 117, in __callback_impl
return request_callback(_type, _payload) if request_callback else {}
File "/home/aml/serl-main/examples/async_peg_insert_drq/async_drq_randomized.py", line 237, in stats_callback
wandb_logger.log(payload, step=update_steps)
File "/home/aml/serl-main/serl_launcher/serl_launcher/common/wandb.py", line 94, in log
wandb.log(data, step=step)
File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/wandb_run.py", line 449, in wrapper
return func(self, *args, **kwargs)
File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/wandb_run.py", line 400, in wrapper_fn
return func(self, *args, **kwargs)
File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/wandb_run.py", line 390, in wrapper
return func(self, *args, **kwargs)
File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/wandb_run.py", line 1877, in log
self._log(data=data, step=step, commit=commit)
File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/wandb_run.py", line 1641, in _log
self._partial_history_callback(data, step, commit)
File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/wandb_run.py", line 1513, in _partial_history_callback
self._backend.interface.publish_partial_history(
File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/interface/interface.py", line 618, in publish_partial_history
self._publish_partial_history(partial_history)
File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/interface/interface_shared.py", line 89, in _publish_partial_history
self._publish(rec)
File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
self._sock_client.send_record_publish(record)
File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
self.send_server_request(server_req)
File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._send_message(msg)
File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._sendall_with_error_handle(header + data)
File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
Please help me see how the problem is caused and how to solve it. If additional information is needed, please let me know. Thank you. @jianlanluo
Hi @iu777, thanks for trying SERL. Your steps for reproducing the peg insertion task looks correct to me. I think the error is due to the save checkpoint path /home/aml/serl-main/examples/async_peg_insert_drq/peg_insert_20
already existing, and the checkpoint saving code in the learner is set to non-overwriting by default. This causes the learner to first crash, then the actor hangs as the learner node is not available to receive the data.
Just delete that ckpt and retry
Thank, @jianlanluo @charlesxu0124
After deleting the CKPT folder and the trajectory path file from the previous training, I re taught the points and 20 spatial rules, and then ran the actor and learner nodes. Currently, the machine is undergoing normal training.
But I have a question: I have observed a slow increase in the progress bar between the actor terminal and the learner terminal, and my "peg" task has been trained for half an hour without successfully inserting it once. And my Franka has a stronger impact force. May I ask which step I set incorrectly or have an issue with? How to solve it? I hope to receive a detailed answer, as I am still exploring machine learning. Thank you very much!
(The terminal on the left of the following image is the Learner node, and the output "5555" and "6666" are added by me to observe the program running. Please ignore them. Thank you.) (The terminal in the upper right corner of the following image is: actor node)
Looking at your learner speed, it looks like you are running the model on CPU. Also make sure you have all the functions JIT'ed as in our repo. Try to reinstall the whole conda environment with jax[cuda]
using the updated instructions on the main branch. It's the least finicky if you start cleaning and reinstall the whole environment from scratch rather than trying to reinstall jax by itself. FYI, we get around 4it/s on a RTX 4090.
Also make sure you read the paper & understand it, so that you can set the reference limiting correctly, this will reduce the impact force
Thank you. I will recheck the environment according to your suggestion and check if the program is using CPU for training instead of GPU.
In the current training, I have encountered a problem. During the training process of my Franka robotic arm, it was originally in a well vertical posture and twisted to a singular point, causing it to be stuck in mid air and unable to perform peg insertion training. Franka was stuck up and down in mid air for a long time, but the program was not interrupted. Is this phenomenon normal? How can I avoid or handle it? Thank you The distorted Franka image is below @jianlanluo @charlesxu0124
Yes the joint singularity is a common issue with the Franka arm after repeated motion. Here's a few tips to alleviate the problem:
joint1_nullspace_stiffness
parameter in the serl_franka_controllers
Hope this helps
Hello everyone, I am deploying “SERL” in the real world and I fully follow the instructions on the webpage for hardware and software deployment. I first want to reproduce task 1: peg insertion Here is my operating procedure, please read it to see if there are any errors. Thank you:
I followed the instructions on the webpage and opened a terminal, running:
"python serl_robot_infra/robot_servers/franka_server.py --gripper_type=Robotiq --robot_ip=172.16.0.2 --gripper_ip=/dev/ttyUSB0"
Follow the instructions to update the TARGETPOSE in peg_env/config.py with the measured end effector pose. Then Record 20 demo trajectories with the spacemouse. Then Edit demo_path and checkpoint_path in run_learner.sh and run-actor.sh
The above steps seem to be fine and everything is normal.
Then run two terminals separately: “bash run_learner.sh“ “bash run_actor.sh”
The
"bash run_learner. sh"
node crashes during runtime, and shortly after,"bash run_actor.sh"
also crashes. I can't find out what caused it. I will paste the information images displayed on the three terminals.The terminal running in the top left corner of the image is:
"bash run_actor.sh"
The terminal running in the upper right corner of the image is"bash run_learner. sh"
The terminal running under the image is:python serl_robot_infra/robot_servers/franka_server.py --gripper_type=Robotiq --robot_ip=172.16.0.2 --gripper_ip=/dev/ttyUSB0
Thank you all for your help