Closed maxwelllls closed 5 months ago
Hi @maxwelllls, the procedure runtime sandbox is not intended to have the cycle looped within the produce. If possible, please use the "Main" panel to create a state machine and call the produce repeatedly. You can have it "Jump" at the end of the block to the same block name to repeat.
Following your suggestion, I switched to using the "Main" panel for loop testing. However, after about 1000 iterations, the pyri_robotics_motion_service still crashed, leading to the failure of movej command calls.
The steps to reliably reproduce the issue are as follows: Run "run_2ur5e_sim", set two sets of joint poses with small joint angle differences, which I named "ur_ref_pos_1" and "ur_ref_pos_1_1", then repeatedly move the robotic arm back and forth between these two positions. It usually takes about 10 minutes to reach 1000 cycles and cause a crash.
Using the command netstat -an | grep 55921
it can be observed that the TCP server remains in the LISTEN state, but a large number of TCP connections are stuck in the CLOSE_WAIT state. This seems to indicate an issue with the server-side program logic, possibly failing to properly handle the logic for closing connections.
After enabling debug mode, the log files obtained continuously indicate failures in IPv6 connections, but there is no information related to the crash. debug_log.txt
Hi @maxwelllls , sorry I haven't had a chance to take a look at this yet. I have been working on improving the documentation for my various projects. (Take a look at the new educational package: https://github.com/robotraconteur/reynard-the-robot). I will take a look at the problem you are having this week. I suspect it is a resource leak somewhere.
Hi, thank you very much for taking the time to address this issue. I have also been studying the RR source code, hoping to gain a deeper understanding and contribute to the development of this project. I appreciate your efforts and look forward to any insights you may have regarding the problem.
Hi @johnwason , I've been able to pinpoint that void ServiceSkel::CleanupGenerators() starts to clean up outdated Generators after 10 minutes, then leads to timeouts when calling the movej function. The code uses mutexes to protect generators, but I suspect there might be conflicts with other resources, such as ASIO.
Given the vastness of the RR framework, my understanding of the code is still quite limited. Could you provide some guidance on directions for further investigation?
@maxwelllls can you run pip list
on your venv so I can see what packages and versions are installed?
Also can you look at the developer console on the web browser? It looks like the browser may be creating a lot of connections and closing them.
Following your instructions, I have provided the requested information. The console messages appeared only upon accessing the webpage and did not update when starting or encountering issues with the Procedures. It was a few minutes after the failure of running Procedures that new messages indicating "client closed" appeared. Attached are the results of the pip list command, screenshots of the browser page, and the full browser console output for your review.
console-export-2024-3-7_9-0-41.txt I hope this information helps with troubleshooting. Thank you for your guidance and support.
Thanks for the data
I think I have reproduced the problem. I am trying to track down the cause.
The connection errors you see in the console are actually normal behavior. There are several candidate URLs to connect to the services, and the errors you are seeing are the rejected URLs. It looks like the connections are successful.
By shortening the cleanup timeout for generators to 1 minute and increasing the call frequency of movej as much as possible (unrelated to the length of the movej path, using rapid short paths and 100% speed), the issue can be quickly reproduced. Looking forward to your findings.
I think the problem is in the class TrajectoryMoveGenerator
in pyri.robotics.robotics_motion_service.__main__
. The cleanup generator function is collecting this class. Something is causing a leak or possibly a concurrency problem. This class is currently implemented as a synchronous generator, where it really should be an asynchronous generator. See https://robotraconteur.github.io/robotraconteur/doc/core/latest/python/api/Generators.html . There is a newer utility in the companion library to help with this design pattern: https://github.com/robotraconteur/robotraconteur_companion_python/blob/master/src/RobotRaconteurCompanion/Util/TaskGenerator.py
It looks like objects from Toppra may be passed to the generator. If Toppra memory is being released by RR it could definitely cause a problem.
The problem was a complex interaction between locks in C++ and the Python Global Interpreter Lock (GIL). In the CleanupGenerators() function, In the function, the generators_lock
is used to protect the map during modification. Before, the generators were being destroyed when the generators_lock
was still held. Because the objects being destroyed were from Python, the GIL had to be locked. Because of the order of locking, this was causing deadlocks with other threads. The pull request https://github.com/robotraconteur/robotraconteur/pull/179 modifies the destruction process so that the actual deletion in Python happens after the lock is released. This is a complex interaction that is harder to predict and can be even harder to fix.
There was also an issue where the generators were not being deleted immediately once they were complete rather than later with a timeout. That is why the issue was only showing up after a while.
Releasing the core Robot Raconteur library is a little complicated so you will need to build from source until I am able to release it.
There is also a problem where the motion execution will get stuck if you stop the procedure. This is unrelated and I will work on fixing it. For now, if you click "Enable Jog" it will clear the deadlock.
Thank you for resolving this issue. I’ve been trying to fix this problem by modifying the code all week, so building from source is no problem at all.
Currently, the code uses CleanupGenerators()
to delete outdated Generators. Could we consider explicitly calling close()
to clean up these Generators?
After reviewing your pull request, I noticed that SendGeneratorResponse() now explicitly erase the generators.
I’ve successfully run 5000 tests without issues.
This is the spec on how the generator should behave. I think there are still a few more places the lifecycle needs to be improved but I think the major errors are fixed.
During a cyclic endurance test, after running 3160 cycles, the robot stopped moving. The following error messages were output in the output window:
Attempting to run Procedures again results in the same error and displays the following message box:
The web page shows that robotics_motion is in a disconnected state:
Attempts to manually start the pyri-robotics-motion-service service fail with the message: "Exception: Identifier UUID or Name already in use":
I have checked the log information but did not find any useful error messages. pyri-core-2024-01-19--13-33-08.zip
System Version: Ubuntu 22.04 + Preempt RT (please disregard 'VirtualBox' as it is actually running on a physical machine). Python Environment: pyenv + Python 3.9.5 Installation Method: Followed the instructions in the readme, installed using pip and pyri-cli webui-install Robot: Realman RML63 Robot Driver: TCP + JSON protocol communication at 100Hz
Could you please advise on how to further investigate this issue to improve system stability? Also, I increased the sampling points to 40 as per this issue. Might this change be affecting the system? Thanks for your help!