Robotics_motion Service Abnormal Disconnection

maxwelllls commented 7 months ago

During a cyclic endurance test, after running 3160 cycles, the robot stopped moving. The following error messages were output in the output window:

cycle count: 3158 cycle count: 3159 cycle count: 3160 Execute procedure "hello" failed:

Traceback (most recent call last): File "/home/boomy/venv/lib/python3.9/site-packages/pyri/sandbox/pyri_sandbox.py", line 290, in Next self._locself._procedure_name File "", line 15, in hello File "/home/boomy/venv/lib/python3.9/site-packages/pyri/robotics/sandbox_functions.py", line 96, in robot_movej move_gen = motion_service.movej(robot_name, joint_pos_degree, float(speed_perc)) File "/home/boomy/venv/lib/python3.9/site-packages/RobotRaconteur/RobotRaconteurPythonUtil.py", line 244, in return lambda self, args: stub_functioncallgenerator(stub, m1.Name, m1, args) File "/home/boomy/venv/lib/python3.9/site-packages/RobotRaconteur/RobotRaconteurPythonUtil.py", line 457, in stub_functioncallgenerator return GeneratorClient(stub.GeneratorFunctionCall(name, m), type1.ReturnType, param_type, stub, stub.RRGetNode()) File "/home/boomy/venv/lib/python3.9/site-packages/RobotRaconteur/RobotRaconteurPython.py", line 6639, in GeneratorFunctionCall return _RobotRaconteurPython.WrappedServiceStub_GeneratorFunctionCall(self, FunctionName, args) RobotRaconteur.RobotRaconteurPythonError.RequestTimeoutException: RobotRaconteur.RequestTimeout Request timeout

Attempting to run Procedures again results in the same error and displays the following message box: 1705713773866

The web page shows that robotics_motion is in a disconnected state:

Attempts to manually start the pyri-robotics-motion-service service fail with the message: "Exception: Identifier UUID or Name already in use":

I have checked the log information but did not find any useful error messages. pyri-core-2024-01-19--13-33-08.zip

System Version: Ubuntu 22.04 + Preempt RT (please disregard 'VirtualBox' as it is actually running on a physical machine). Python Environment: pyenv + Python 3.9.5 Installation Method: Followed the instructions in the readme, installed using pip and pyri-cli webui-install Robot: Realman RML63 Robot Driver: TCP + JSON protocol communication at 100Hz

Could you please advise on how to further investigate this issue to improve system stability? Also, I increased the sampling points to 40 as per this issue. Might this change be affecting the system? Thanks for your help!

johnwason commented 7 months ago

Hi @maxwelllls, the procedure runtime sandbox is not intended to have the cycle looped within the produce. If possible, please use the "Main" panel to create a state machine and call the produce repeatedly. You can have it "Jump" at the end of the block to the same block name to repeat.

https://github.com/pyri-project/pyri-core/blob/master/doc/user_programming_guide.md#main-state-machine

maxwelllls commented 6 months ago

Following your suggestion, I switched to using the "Main" panel for loop testing. However, after about 1000 iterations, the pyri_robotics_motion_service still crashed, leading to the failure of movej command calls.

The steps to reliably reproduce the issue are as follows: Run "run_2ur5e_sim", set two sets of joint poses with small joint angle differences, which I named "ur_ref_pos_1" and "ur_ref_pos_1_1", then repeatedly move the robotic arm back and forth between these two positions. It usually takes about 10 minutes to reach 1000 cycles and cause a crash.

maxwelllls commented 6 months ago

Using the command netstat -an | grep 55921 it can be observed that the TCP server remains in the LISTEN state, but a large number of TCP connections are stuck in the CLOSE_WAIT state. This seems to indicate an issue with the server-side program logic, possibly failing to properly handle the logic for closing connections.

maxwelllls commented 6 months ago

After enabling debug mode, the log files obtained continuously indicate failures in IPv6 connections, but there is no information related to the crash. debug_log.txt

johnwason commented 6 months ago

Hi @maxwelllls , sorry I haven't had a chance to take a look at this yet. I have been working on improving the documentation for my various projects. (Take a look at the new educational package: https://github.com/robotraconteur/reynard-the-robot). I will take a look at the problem you are having this week. I suspect it is a resource leak somewhere.

maxwelllls commented 6 months ago

Hi, thank you very much for taking the time to address this issue. I have also been studying the RR source code, hoping to gain a deeper understanding and contribute to the development of this project. I appreciate your efforts and look forward to any insights you may have regarding the problem.

maxwelllls commented 5 months ago

Hi @johnwason , I've been able to pinpoint that void ServiceSkel::CleanupGenerators() starts to clean up outdated Generators after 10 minutes, then leads to timeouts when calling the movej function. The code uses mutexes to protect generators, but I suspect there might be conflicts with other resources, such as ASIO.

Given the vastness of the RR framework, my understanding of the code is still quite limited. Could you provide some guidance on directions for further investigation?

johnwason commented 5 months ago

@maxwelllls can you run pip list on your venv so I can see what packages and versions are installed?

johnwason commented 5 months ago

Also can you look at the developer console on the web browser? It looks like the browser may be creating a lot of connections and closing them.

maxwelllls commented 5 months ago

Following your instructions, I have provided the requested information. The console messages appeared only upon accessing the webpage and did not update when starting or encountering issues with the Procedures. It was a few minutes after the failure of running Procedures that new messages indicating "client closed" appeared. Attached are the results of the pip list command, screenshots of the browser page, and the full browser console output for your review.

pip list.txt

console-export-2024-3-7_9-0-41.txt I hope this information helps with troubleshooting. Thank you for your guidance and support.

johnwason commented 5 months ago

Thanks for the data

I think I have reproduced the problem. I am trying to track down the cause.

The connection errors you see in the console are actually normal behavior. There are several candidate URLs to connect to the services, and the errors you are seeing are the rejected URLs. It looks like the connections are successful.

maxwelllls commented 5 months ago

By shortening the cleanup timeout for generators to 1 minute and increasing the call frequency of movej as much as possible (unrelated to the length of the movej path, using rapid short paths and 100% speed), the issue can be quickly reproduced. Looking forward to your findings.

johnwason commented 5 months ago

I think the problem is in the class TrajectoryMoveGenerator in pyri.robotics.robotics_motion_service.__main__. The cleanup generator function is collecting this class. Something is causing a leak or possibly a concurrency problem. This class is currently implemented as a synchronous generator, where it really should be an asynchronous generator. See https://robotraconteur.github.io/robotraconteur/doc/core/latest/python/api/Generators.html . There is a newer utility in the companion library to help with this design pattern: https://github.com/robotraconteur/robotraconteur_companion_python/blob/master/src/RobotRaconteurCompanion/Util/TaskGenerator.py

johnwason commented 5 months ago

It looks like objects from Toppra may be passed to the generator. If Toppra memory is being released by RR it could definitely cause a problem.

johnwason commented 5 months ago

The problem was a complex interaction between locks in C++ and the Python Global Interpreter Lock (GIL). In the CleanupGenerators() function, In the function, the generators_lock is used to protect the map during modification. Before, the generators were being destroyed when the generators_lock was still held. Because the objects being destroyed were from Python, the GIL had to be locked. Because of the order of locking, this was causing deadlocks with other threads. The pull request https://github.com/robotraconteur/robotraconteur/pull/179 modifies the destruction process so that the actual deletion in Python happens after the lock is released. This is a complex interaction that is harder to predict and can be even harder to fix.

There was also an issue where the generators were not being deleted immediately once they were complete rather than later with a timeout. That is why the issue was only showing up after a while.

Releasing the core Robot Raconteur library is a little complicated so you will need to build from source until I am able to release it.

There is also a problem where the motion execution will get stuck if you stop the procedure. This is unrelated and I will work on fixing it. For now, if you click "Enable Jog" it will clear the deadlock.

maxwelllls commented 5 months ago

Thank you for resolving this issue. I’ve been trying to fix this problem by modifying the code all week, so building from source is no problem at all.

Currently, the code uses CleanupGenerators()to delete outdated Generators. Could we consider explicitly calling close() to clean up these Generators?

maxwelllls commented 5 months ago

After reviewing your pull request, I noticed that SendGeneratorResponse() now explicitly erase the generators.

I’ve successfully run 5000 tests without issues.

johnwason commented 5 months ago

https://github.com/robotraconteur/robotraconteur_standards/blob/master/draft/object_protocol.md#generatornext

This is the spec on how the generator should behave. I think there are still a few more places the lifecycle needs to be improved but I think the major errors are fixed.

pyri-project / pyri-robotics

Robotics_motion Service Abnormal Disconnection #4