jsk-ros-pkg / jsk_roseus

ROS EusLisp Client
http://wiki.ros.org/roseus/Tutorials
17 stars 56 forks source link

Memory faults on subscribers #668

Open Affonso-Gui opened 3 years ago

Affonso-Gui commented 3 years ago

I have been struggling with weird segfaults on my custom build for some time but I believe I have finally found a way to reproduce them in the release build:

  1. rostopic pub /msg std_msgs/String "hello" -r 100
  2. Execute the following
    
    (ros::roseus "test")

(defclass my-sub :slots (msg))

(defmethod my-sub (:init () self) (:update (_msg) (setq msg _msg)) (:get () msg))

(defun my-one-shot (topic msgtype) (let ((sub (instance my-sub :init))) (ros::subscribe topic msgtype #'send sub :update) (while (not (send sub :get)) (ros::spin-once)) (send sub :get)))

(defun foo () (let ((i 0)) (while t (if (zerop (mod (incf i) 100)) (print i)) (my-one-shot "/msg" std_msgs::String))))

(foo)



This will execute normally at first but eventually cause a segmentation fault if ran for long enough. 
Testing with the  1.7.4-1bionic.20210505.033302 and  1.7.4-1bionic.20200930.061900 builds several times the segmentation fault occurred around 900, 2400 and 8600 iterations, depending on each run.

#### What I think the problem is
The subscribers registered in this case look something like `(#<compiled-code #X5610a667fe40 send> #<my-sub #X5610a6cdb6c8> :update)` and are attributed to symbols in the lisp package to avoid being garbage collected.
However, I believe that the object instances within the registered and supposedly guarded list are actually succeptible to being garbage collected, leading to the observed segmentation faults.
I have also verified that the errors always occur just after a garbage collection.

@708yamaguchi Could you see if you can recreate the situation in your machine?
708yamaguchi commented 3 years ago

Yes, the problem is reproduced on my machine as well. My environment is as follows:

$ apt show ros-melodic-roseus
Package: ros-melodic-roseus
Version: 1.7.4-1bionic.20210505.033302
...

In my environment, I called the (foo) function 10 times while restarting roseus, all of which printed 900 and caused a segmentation fault.

1.irteusgl$ (foo)
100
200
300
400
500
600
700
800
900
Call Stack (max depth: 20):
  0: at (ros::spin-once)
  1: at (ros::spin-once)
  2: at (while (not (send sub :get)) (ros::spin-once))
  3: at (let ((sub (instance my-sub :init))) (ros::subscribe topic msgtype #'send sub :update) (while (not (send sub :get)) (ros::spin-once)) (send sub :get))
  4: at (my-one-shot "/msg" std_msgs::string)
  5: at (while t (if (zerop (mod (incf i) 100)) (print i)) (my-one-shot "/msg" std_msgs::string))
  6: at (let ((i 0)) (while t (if (zerop (mod (incf i) 100)) (print i)) (my-one-shot "/msg" std_msgs::string)))
  7: at (foo)
  8: at #<compiled-code #X560a24deee18>
;p=pointer?(0x560a24d2c078)
;; Segmentation Fault.
;; in /opt/ros/melodic/share/euslisp/jskeus/eus/Linux64/bin/irteusgl 0 error: cannot find method euserror
;; You are still in a signal handler.
;;Try reset or throw to upper level as soon as possible.
;; code=-1472911824 x=a8352500 addr=0
Fatal: 
YoheiKakiuchi commented 3 years ago

It is a workaround, I think. 'gensym' may occure GC. So, we should push args(and other objects) to stack. Then, pop it.

diff --git a/roseus/roseus.cpp b/roseus/roseus.cpp
index f0b4a19..b27ca20 100644
--- a/roseus/roseus.cpp
+++ b/roseus/roseus.cpp
@@ -369,8 +369,10 @@ public:
       ROS_ERROR("subscription callback function install error");
     }
     // avoid gc
+    vpush(args);
     pointer p=gensym(ctx);
     setval(ctx,intern(ctx,(char*)(p->c.sym.pname->c.str.chars),strlen((char*)(p->c.sym.pname->c.str.chars)),lisppkg),cons(ctx,scb,args));
+    vpop(); // pop agrs
   }
   ~EuslispSubscriptionCallbackHelper() {
       ROS_ERROR("subscription gc");
Affonso-Gui commented 3 years ago

@YoheiKakiuchi I have previously attempted a similar approach in https://github.com/jsk-ros-pkg/jsk_roseus/pull/636 and was kind of uncertain if it made any sense, but I guess that it actually is a valid solution for this error then! With these changes I was able to run 40,000+ iterations without any problem.

YoheiKakiuchi commented 3 years ago

Here is the correct place to fix this problem.

@@ -870,11 +872,12 @@ pointer ROSEUS_SUBSCRIBE(register context *ctx,int n,pointer *argv)
   fncallback = argv[2];
   args=NIL;
   for (int i=n-1;i>=3;i--) args=cons(ctx,argv[i],args);
-
+  vpush(args);
   EuslispMessage msg(message);
    boost::shared_ptr<SubscriptionCallbackHelper> *callback =
      (new boost::shared_ptr<SubscriptionCallbackHelper>
       (new EuslispSubscriptionCallbackHelper(fncallback, args, message)));
+  vpop();
   SubscribeOptions so(topicname, queuesize, msg.__getMD5Sum(), msg.__getDataType());
   so.helper = *callback;
   Subscriber subscriber = lnode->subscribe(so);

'args' is created on source of C-lang, so 'args' is not protected anywhere. Then 'args' may be collected as a garbege. ( objects in argv, such as 'sub' in this case, are not collected ) fncallback may also be protected because it is arguments of function.

Affonso-Gui commented 3 years ago

This fix also seems stable!

I see, so it is not exactly the object instance being collected but the list as a whole, right? I'll do some more testing tomorrow and (re)organize a PR if everything looks fine.

Also any insights if we should dump the lisp symbol on unsubscriptions? Registering on subscriptions and never unregistering them does seem like a memory leak...

YoheiKakiuchi commented 3 years ago

Yes, list(cons) object is collected. objects in list are safe.

Can we delete symbol ? I think reuse of symbol is realistic. Repeating subscribe many times is non sense, so it's not really a problem.