A positional error occurred while running “ bash run_learner.sh ”

iu777 commented 4 months ago

Hello everyone, I am deploying “SERL” in the real world and I fully follow the instructions on the webpage for hardware and software deployment. I first want to reproduce task 1: peg insertion Here is my operating procedure, please read it to see if there are any errors. Thank you:

I followed the instructions on the webpage and opened a terminal, running: "python serl_robot_infra/robot_servers/franka_server.py --gripper_type=Robotiq --robot_ip=172.16.0.2 --gripper_ip=/dev/ttyUSB0"

Follow the instructions to update the TARGETPOSE in peg_env/config.py with the measured end effector pose. Then Record 20 demo trajectories with the spacemouse. Then Edit demo_path and checkpoint_path in run_learner.sh and run-actor.sh

The above steps seem to be fine and everything is normal.

Then run two terminals separately: “bash run_learner.sh“ “bash run_actor.sh”

The "bash run_learner. sh" node crashes during runtime, and shortly after,"bash run_actor.sh" also crashes. I can't find out what caused it. I will paste the information images displayed on the three terminals.

The terminal running in the top left corner of the image is: "bash run_actor.sh" The terminal running in the upper right corner of the image is "bash run_learner. sh" The terminal running under the image is: python serl_robot_infra/robot_servers/franka_server.py --gripper_type=Robotiq --robot_ip=172.16.0.2 --gripper_ip=/dev/ttyUSB0

Thank you all for your help

jianlanluo commented 4 months ago

hey, can you paste your crash message on your learner?

FYI @youliangtan @charlesxu0124 it looks like the agentlace timeout issue is still here on the actor side, we should plan to fix that permanently.

iu777 commented 4 months ago

Of course, I can paste all the information of the "bash run_learner. sh"terminal below for your reference. Thank you:

(serl) aml@aml-NUC11PHi7:~/serl-main/examples/async_peg_insert_drq$ bash run_learner.sh
2024-07-01 16:12:52.652413: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
I0701 16:12:54.338428 140355677083456 _schedule.py:74] A polynomial schedule was set with a non-positive `transition_steps` value; this results in a constant schedule with value `init_value`.
I0701 16:12:54.338671 140355677083456 _schedule.py:74] A polynomial schedule was set with a non-positive `transition_steps` value; this results in a constant schedule with value `init_value`.
I0701 16:12:54.338775 140355677083456 _schedule.py:74] A polynomial schedule was set with a non-positive `transition_steps` value; this results in a constant schedule with value `init_value`.
/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/flax/core/lift.py:305: RuntimeWarning: kwargs are not supported in vmap, so "train" is(are) ignored
  warnings.warn(msg.format(name, ', '.join(kwargs.keys())), RuntimeWarning)
The ResNet-10 weights already exist at '/home/aml/.serl/resnet10_params.pkl'.
Loaded 5.418792M parameters from ResNet-10 pretrained on ImageNet-1K
replaced conv_init in pretrained_encoder
replaced norm_init in pretrained_encoder
replaced ResNetBlock_0 in pretrained_encoder
replaced ResNetBlock_1 in pretrained_encoder
replaced ResNetBlock_2 in pretrained_encoder
replaced ResNetBlock_3 in pretrained_encoder
entity: null
exp_descriptor: serl_dev_drq_rlpd10demos_peg_insert_random_resnet_097
experiment_id: serl_dev_drq_rlpd10demos_peg_insert_random_resnet_097_20240701_161301
group: null
project: serl_dev
tag: serl_dev_drq_rlpd10demos_peg_insert_random_resnet_097
unique_identifier: '20240701_161301'

wandb: Currently logged in as: 345915750 (zpb345915750). Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.17.2
wandb: Run data is saved locally in /tmp/tmpy80et43_/wandb/run-20240701_161304-serl_dev_drq_rlpd10demos_peg_insert_random_resnet_097_20240701_161301
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run serl_dev_drq_rlpd10demos_peg_insert_random_resnet_097_20240701_161301
wandb: ⭐️ View project at https://wandb.ai/zpb345915750/serl_dev
wandb: 🚀 View run at https://wandb.ai/zpb345915750/serl_dev/runs/serl_dev_drq_rlpd10demos_peg_insert_random_resnet_097_20240701_161301
demo buffer size: 576
 starting learner loop
Filling up replay buffer: 204it [01:09,  2.95it/s]                                                    
 sent initial network to actor
learner:   0%|                                                            | 0/1000000 [00:00<?, ?it/s]/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/flax/core/lift.py:305: RuntimeWarning: kwargs are not supported in vmap, so "train" is(are) ignored
  warnings.warn(msg.format(name, ', '.join(kwargs.keys())), RuntimeWarning)
/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/flax/core/lift.py:305: RuntimeWarning: kwargs are not supported in vmap, so "train" is(are) ignored
  warnings.warn(msg.format(name, ', '.join(kwargs.keys())), RuntimeWarning)
/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/flax/core/lift.py:305: RuntimeWarning: kwargs are not supported in vmap, so "train" is(are) ignored
  warnings.warn(msg.format(name, ', '.join(kwargs.keys())), RuntimeWarning)
/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/flax/core/lift.py:305: RuntimeWarning: kwargs are not supported in vmap, so "train" is(are) ignored
  warnings.warn(msg.format(name, ', '.join(kwargs.keys())), RuntimeWarning)
/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/flax/core/lift.py:305: RuntimeWarning: kwargs are not supported in vmap, so "train" is(are) ignored
  warnings.warn(msg.format(name, ', '.join(kwargs.keys())), RuntimeWarning)
/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/flax/core/lift.py:305: RuntimeWarning: kwargs are not supported in vmap, so "train" is(are) ignored
  warnings.warn(msg.format(name, ', '.join(kwargs.keys())), RuntimeWarning)
/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/flax/core/lift.py:305: RuntimeWarning: kwargs are not supported in vmap, so "train" is(are) ignored
  warnings.warn(msg.format(name, ', '.join(kwargs.keys())), RuntimeWarning)
I0701 16:16:54.329327 140355677083456 checkpoints.py:574] Saving checkpoint at step: 0
I0701 16:16:54.329805 140355677083456 checkpoints.py:662] Using Orbax as backend to save Flax checkpoints. For potential troubleshooting see: https://flax.readthedocs.io/en/latest/guides/training_techniques/use_checkpointing.html#orbax-as-backend-troubleshooting
W0701 16:16:54.331048 140355677083456 type_handlers.py:222] SaveArgs.aggregate is deprecated, please use custom TypeHandler (https://orbax.readthedocs.io/en/latest/custom_handlers.html#typehandler) or contact Orbax team to migrate before August 1st, 2024.
I0701 16:16:54.335811 140355677083456 checkpointer.py:157] Saving checkpoint to /home/aml/serl-main/examples/async_peg_insert_drq/peg_insert_20/checkpoint_0.
learner:   0%|                                                           | 0/1000000 [02:29<?, ?it/s] 
Traceback (most recent call last):
  File "/home/aml/serl-main/examples/async_peg_insert_drq/async_drq_randomized.py", line 415, in <module>
    app.run(main)
  File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/home/aml/serl-main/examples/async_peg_insert_drq/async_drq_randomized.py", line 394, in main
    learner(
  File "/home/aml/serl-main/examples/async_peg_insert_drq/async_drq_randomized.py", line 313, in learner
    checkpoints.save_checkpoint(
  File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/flax/training/checkpoints.py", line 697, in save_checkpoint
    orbax_checkpointer.save(
  File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/orbax/checkpoint/checkpointer.py", line 164, in save
    raise ValueError(f'Destination {directory} already exists.')
ValueError: Destination /home/aml/serl-main/examples/async_peg_insert_drq/peg_insert_20/checkpoint_0 already exists.
wandb: - 0.033 MB of 0.033 MB uploaded
wandb: Run history:
wandb:             actor/actor_loss ▁
wandb:                actor/entropy ▁
wandb:            actor/temperature ▁
wandb:                     actor_lr ▁
wandb:           critic/critic_loss ▁
wandb:          critic/predicted_qs ▁
wandb:             critic/target_qs ▁
wandb:                    critic_lr ▁
wandb: temperature/temperature_loss ▁
wandb:               temperature_lr ▁
wandb:         timer/sample_actions ▁
wandb:   timer/sample_replay_buffer ▁
wandb:               timer/step_env ▁
wandb:                  timer/total ▁
wandb:                  timer/train ▁
wandb:          timer/train_critics ▁
wandb: 
wandb: Run summary:
wandb:             actor/actor_loss 0.26329
wandb:                actor/entropy -6.81702
wandb:            actor/temperature 0.01
wandb:                     actor_lr 0.0003
wandb:           critic/critic_loss 1.82864
wandb:          critic/predicted_qs -0.2719
wandb:             critic/target_qs -1.03865
wandb:                    critic_lr 0.0003
wandb: temperature/temperature_loss -0.03989
wandb:               temperature_lr 0.0003
wandb:         timer/sample_actions 0.05981
wandb:   timer/sample_replay_buffer 0.05218
wandb:               timer/step_env 0.1329
wandb:                  timer/total 0.20114
wandb:                  timer/train 59.3275
wandb:          timer/train_critics 30.11736
wandb: 
wandb: 🚀 View run serl_dev_drq_rlpd10demos_peg_insert_random_resnet_097_20240701_161301 at: https://wandb.ai/zpb345915750/serl_dev/runs/serl_dev_drq_rlpd10demos_peg_insert_random_resnet_097_20240701_161301
wandb: ⭐️ View project at: https://wandb.ai/zpb345915750/serl_dev
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 1 other file(s)
wandb: Find logs at: /tmp/tmpy80et43_/wandb/run-20240701_161304-serl_dev_drq_rlpd10demos_peg_insert_random_resnet_097_20240701_161301/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with `wandb.require("core")`! See https://wandb.me/wandb-core for more information.
Exception in thread Thread-4 (run):
Traceback (most recent call last):
  File "/home/aml/anaconda3/envs/serl/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/home/aml/anaconda3/envs/serl/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/home/aml/agentlace-main/agentlace/zmq_wrapper/req_rep.py", line 49, in run
    res = self.impl_callback(message)
  File "/home/aml/agentlace-main/agentlace/trainer.py", line 117, in __callback_impl
    return request_callback(_type, _payload) if request_callback else {}
  File "/home/aml/serl-main/examples/async_peg_insert_drq/async_drq_randomized.py", line 237, in stats_callback
    wandb_logger.log(payload, step=update_steps)
  File "/home/aml/serl-main/serl_launcher/serl_launcher/common/wandb.py", line 94, in log
    wandb.log(data, step=step)
  File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/wandb_run.py", line 449, in wrapper
    return func(self, *args, **kwargs)
  File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/wandb_run.py", line 400, in wrapper_fn
    return func(self, *args, **kwargs)
  File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/wandb_run.py", line 390, in wrapper
    return func(self, *args, **kwargs)
  File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/wandb_run.py", line 1877, in log
    self._log(data=data, step=step, commit=commit)
  File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/wandb_run.py", line 1641, in _log
    self._partial_history_callback(data, step, commit)
  File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/wandb_run.py", line 1513, in _partial_history_callback
    self._backend.interface.publish_partial_history(
  File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/interface/interface.py", line 618, in publish_partial_history
    self._publish_partial_history(partial_history)
  File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/interface/interface_shared.py", line 89, in _publish_partial_history
    self._publish(rec)
  File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
    self._sock_client.send_record_publish(record)
  File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
    self.send_server_request(server_req)
  File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
    self._send_message(msg)
  File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
    self._sendall_with_error_handle(header + data)
  File "/home/aml/anaconda3/envs/serl/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
    sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe

iu777 commented 4 months ago

Please help me see how the problem is caused and how to solve it. If additional information is needed, please let me know. Thank you. @jianlanluo

charlesxu0124 commented 4 months ago

Hi @iu777, thanks for trying SERL. Your steps for reproducing the peg insertion task looks correct to me. I think the error is due to the save checkpoint path /home/aml/serl-main/examples/async_peg_insert_drq/peg_insert_20 already existing, and the checkpoint saving code in the learner is set to non-overwriting by default. This causes the learner to first crash, then the actor hangs as the learner node is not available to receive the data.

jianlanluo commented 4 months ago

Just delete that ckpt and retry

iu777 commented 4 months ago

Thank， @jianlanluo @charlesxu0124

After deleting the CKPT folder and the trajectory path file from the previous training, I re taught the points and 20 spatial rules, and then ran the actor and learner nodes. Currently, the machine is undergoing normal training.

But I have a question: I have observed a slow increase in the progress bar between the actor terminal and the learner terminal, and my "peg" task has been trained for half an hour without successfully inserting it once. And my Franka has a stronger impact force. May I ask which step I set incorrectly or have an issue with? How to solve it? I hope to receive a detailed answer, as I am still exploring machine learning. Thank you very much!

(The terminal on the left of the following image is the Learner node, and the output "5555" and "6666" are added by me to observe the program running. Please ignore them. Thank you.) (The terminal in the upper right corner of the following image is: actor node)

charlesxu0124 commented 4 months ago

Looking at your learner speed, it looks like you are running the model on CPU. Also make sure you have all the functions JIT'ed as in our repo. Try to reinstall the whole conda environment with jax[cuda] using the updated instructions on the main branch. It's the least finicky if you start cleaning and reinstall the whole environment from scratch rather than trying to reinstall jax by itself. FYI, we get around 4it/s on a RTX 4090.

jianlanluo commented 4 months ago

Also make sure you read the paper & understand it, so that you can set the reference limiting correctly, this will reduce the impact force

iu777 commented 4 months ago

Thank you. I will recheck the environment according to your suggestion and check if the program is using CPU for training instead of GPU.

In the current training, I have encountered a problem. During the training process of my Franka robotic arm, it was originally in a well vertical posture and twisted to a singular point, causing it to be stuck in mid air and unable to perform peg insertion training. Franka was stuck up and down in mid air for a long time, but the program was not interrupted. Is this phenomenon normal? How can I avoid or handle it? Thank you The distorted Franka image is below @jianlanluo @charlesxu0124

QQMail_0

charlesxu0124 commented 4 months ago

Yes the joint singularity is a common issue with the Franka arm after repeated motion. Here's a few tips to alleviate the problem:

Before you start the run, make sure the robot is in a neutral position; especially make sure the first (bottom) joint is centered. This slows down the joint configuration change
Increase the joint1_nullspace_stiffness parameter in the serl_franka_controllers
Use the joint reset feature every N episodes

Hope this helps

rail-berkeley / serl

A positional error occurred while running “ bash run_learner.sh ” #66