jsk-ros-pkg / jsk_roseus

ROS EusLisp Client
http://wiki.ros.org/roseus/Tutorials
17 stars 56 forks source link

roseusでresetしないとone-shot-subscribeが使えない #666

Open tkmtnt7000 opened 3 years ago

tkmtnt7000 commented 3 years ago

Fetchで音声認識をさせているときにone-shot-subscribeが返ってこないことがありました。かんたんなケースで再現させたのが以下になります。

one-shot-subscribeを呼んでトピックを待っている最中にCtrl-cで止めてもう一度one-shot-subscribeを起動するとsubscribeされない。 Ctrl-cの後にresetを入れると正しくsubscribeされる。

resetにはどのような効果があるのか知りたいです。 @708yamaguchi

再現コード

eustf roseus_c_util 1.irteusgl$ (ros::roseus "hoge")
t
2.irteusgl$ (one-shot-subscribe "/chatter" std_msgs::String) ;; subscribeできている
#<std_msgs::string #X561d81801580>
3.irteusgl$ (one-shot-subscribe "/chatter" std_msgs::String) ;; subscribeできている
#<std_msgs::string #X561d817f3f58>
4.irteusgl$ (one-shot-subscribe "/chatter" std_msgs::String) ;; subscribeできている
#<std_msgs::string #X561d81915dd8>
5.irteusgl$ (one-shot-subscribe "/chatter" std_msgs::String) ;; ここでCtrl-cをした
  C-c C-cinterrupt6.B1-irteusgl$ (one-shot-subscribe "/chatter" std_msgs::String) ;; subscribeしても返ってこない
  C-c C-cinterrupt7.B2-irteusgl$ reset
8.irteusgl$ (one-shot-subscribe "/chatter" std_msgs::String) ;; subscribeできるようになった
#<std_msgs::string #X561d816aca70>

ターミナル1

$ roscore

ターミナル2

$ rostopic pub /chatter std_msgs/String "data: 'hello'" -1
708yamaguchi commented 3 years ago

僕も一緒にデバッグしていたのですが、「resetに効果がありそう」というところまでしか分からず、根本的な解決理由が分かりませんでした。 どなたか教えてくださると勉強になり助かります。

Affonso-Gui commented 3 years ago

Ok, this was actually quite interesting to take a look at, and I think I have figured it out.

The problem revolves around the fact that the one-shot-subscribe is doing some pretty risky stuff to simulate closures in eus. https://github.com/jsk-ros-pkg/jsk_roseus/blob/master/roseus/euslisp/roseus-utils.l#L1117-L1141

Particularly, there are three aspects creating this bug:

  1. When the code is interrupted by C-c, the function exits to a new repl without unsubscribing the topic.
  2. To avoid overwriting subscribers (roseus only supports one per topic), one-shot-subscribe checks if there are any other active subscribers with the ros::get-num-publishers (yes, the function name is misleading), and proceeds to using the existing subscriber if any is found.
  3. This is the tricky part. When registering a subscriber from within a let frame you are basically adding the pointer value of the bind frame to a global object: ((lambda-closure nil 93867875589976 0 (msg) (setq lmsg msg))). And what happens here is that when entering another repl the new dynamic environment being introduced kind of shades the previous one, so the message is properly received and handled, but the value is assigned to a different value of lmsg and therefore the loop never exits.

Try the following:

There are a few possible solutions to this, but I would just reject cases in which ros::get-num-publishers is non-nil altogether. EDIT: it really is kind of nonsense to use any previous subscriber registered and just hope that they will set the lmsg for you.

Affonso-Gui commented 3 years ago

(I would also like to add some self-advertising and say that with a proper exception handling framework we could for example ensure that the topic has been unsubscribed before proceeding to the next repl)

708yamaguchi commented 3 years ago

Thank you very much for the explanation. I understand the root of the problem a little better now.

As I understand it, the function #'(lambda (msg) (setq lmsg msg)) itself is registered in the subscriber, which is causing the problem. In addition to the problems you reported, I found the following problem. After the following steps,

one-shot-subscribe-> C-c -> reset -> (let (a) (one-shot-subscribe ...)) -> (topic is published, but loop never exits) -> C-c -> reset

lmsg is binded like below

5.irteusgl$ lmsg
#<std_msgs::string #X55fad4cf26a8>

Excuse me, but I have an additional question: I would like to know the difference between the following two cases. I was expecting Bug to occur in both cases, but in fact it did not occur in the case above. What is the effect of (let (a) ...) ?

Affonso-Gui commented 3 years ago

Here is an annotated example. Remember that closures are executed from their own lexical environment, and not from the one active at invocation

;;;; Create a "pointer-closure"
eus$ (let (a) (defvar fn #'(lambda () (setq a 1))))
;; fn
eus$ fn
;; (lambda-closure nil 94458722936800 0 nil (setq a 1))

;;;; Observe that any occurence of `(let (a) ...)' will lead to the same pointer location
eus$ (let (a) #'(lambda () nil))
;; (lambda-closure nil 94458722936800 0 nil nil)
eus$ (let (a) #'(lambda () nil))
;; (lambda-closure nil 94458722936800 0 nil nil)
eus$ (let (a) (sys:list-all-bindings))
;; ((a))

;;;; Enter a new repl
eus$ (let (a) (error "this"))
;; eus 0 error: this in (error "this")

;;;; The lexical environment of the time of the error is preserved
;;;; (which is what allows for dynamic-eval debugging)
E1-eus$ (sys:list-all-bindings)
;; ((a))

;;;; Now I'm not sure why, but in this new repl the allocation for `(let (a) ...)' goes to a different pointer
E1-eus$ (let (a) #'(lambda () nil))
;; (lambda-closure nil 94458722937456 0 nil nil)
E1-eus$ (let (a) #'(lambda () nil))
;; (lambda-closure nil 94458722937456 0 nil nil)
E1-eus$ (let (a) (sys:list-all-bindings))
;; ((a) (a))

;;;; So when we invoke the function the value is set on the second environment
;;;; But since value references point to the first value it always remains null
E1-eus$ (let (a) (funcall fn) (sys:list-all-bindings))
;; ((a) (a . 1))

;;;; Reset to the original repl
E1-eus$ reset

;;;; There are no active environments and frame allocation goes to the same pointer as before
eus$ (sys:list-all-bindings)
;; nil
eus$ (let (a) #'(lambda () nil))
;; (lambda-closure nil 94458722936800 0 nil nil)

;;;; So if we recreate the `(let (a) ...)' environment and invoke the function everything goes well
;;;; (this is why calling one-shot-subscribe here works even with an outdated callback)
eus$ (let (a) (funcall fn) (sys:list-all-bindings))
;; ((a . 1))

;;;; I am not sure how this works as well, but pointer allocation also seems to work differently for nested frames
;;;; Note that the pointer for the same active `(let (a) ...)' environment is different
eus$ (let (b) (let (a) #'(lambda () nil)))
;; (lambda-closure nil 94458722936856 0 nil nil)

;;;; So what happens here is that the original frame at 94458722936800 is not active anymore,
;;;; and since the symbol `a' cannot be located at any dynamic frame the global value is set instead.
;;;; (for all I know the process could easily segfault at this time, but probably we have some fail safe code identifying that the frame is not active and reallocating to a global environment?)
eus$ (let (b) (let (a) (funcall fn) (sys:list-all-bindings)))
;; ((a) (b))
eus$ a
;; 1
708yamaguchi commented 3 years ago

@Affonso-Gui Thank you very much for the repeated explanations.

After talking to @pazeshun, I learned that similar problems have been reported before. https://github.com/euslisp/EusLisp/issues/147#issuecomment-153003343

We have also discovered a more bad bug behavior. We believe this is fatal because the bug occurs without using C-c.

(one-shot-subscribe ... :unsubscribe nil) ;; topic received -> (let (a) (one-shot-subscribe ...) ;; Bug (repl do not return from one-shot-subscribe)

Here is a summary of the current problems.

1. (one-shot-subscribe ...) -> C-c -> reset -> (let (a) (one-shot-subscribe ...)) ;; Bug
2. (one-shot-subscribe ... :unsubscribe nil) -> (let (a) (one-shot-subscribe ...)) ;; Bug
3. (one-shot-subscribe ...) -> C-c -> (one-shot-subscribe ...) ;; Bug

In response to these problems, @YoheiKakiuchi gave me the following advices.

  1. Use (unwind-protect) so that (ros::unsubscribe) will be called even when C-c -> reset.
  2. Use (lambda-closure nil ...) instead of #'(lambda ...) for the callback function of (ros::subscribe ...)
  3. Use '(lambda ...) instead of #'(lambda ...) for the callback function of (ros::subscribe ...). However, '(lambda ...) is not a LAMCLOSURE and an error occurs, so I gave up. https://github.com/jsk-ros-pkg/jsk_roseus/blob/b214284e24ad6287bf17545b9a73c541190fac1a/roseus/roseus.cpp#L360-L370

Currently, I think

I am working on the following branch to modify one-shot-subscribe. I will create PR if these solutions seems good. https://github.com/708yamaguchi/jsk_roseus/tree/one-shot-subscribe-error

Affonso-Gui commented 3 years ago

Some more problems:

4. (ros::subscribe topic ...) -> (one-shot-subscribe topic) -> Bug
5. one-shot-subscribe -> C-c -> reset -> (ros::spin-once) -> Endless loop if a message is received
6. one-shot-subscribe -> C-c -> reset -> (let (a) (ros::spin-once)) -> Segmentation Fault if a message is received

A few drawbacks to your solution:

  1. unwind-protect is only called after reset, so it doesn't cover Problem 3, and can also cause problems if the user defines another subscriber in the new repl. one-shot-subscribe -> C-c -> continue programming -> eventually call ros::subscribe on the same topic -> when resetting the user defined subscriber will be gone (probably without the user realizing it)

  2. The advantage of using lambda-closure is that you can force the frame pointer to zero even within let blocks (which means global scope and is not what we want here). There is no difference if you are combining the pointer from another lambda; just notice how you are creating a form identical to the (lambda-closure nil 94458722936800 0 nil (setq a 1)) observed above.

My solution for problems 1-4:

  1. If a subscriber is already registered print a warning and either raise an error or forcefully overwrite it. As I said before the subscriber required here is highly localized so it makes no sense to use any previously installed callback and just hope that it will set lmsg in the active frame.
  2. Remove the unsubscribe key argument; the user is always welcome to use a proper ros::subscribe call if they want to keep the subscriber active.

My solution for problems 5-6:

  1. Use "class closures" instead of "pointer closures". Simplified example:
    
    (defclass my-sub :slots (topic msgtype msg))

(defmethod my-sub (:init (_topic _msgtype) (setq topic _topic) (setq msgtype _msgtype) self) (:update (_msg) (setq msg _msg)) (:get () msg))

(defun my-one-shot (topic msgtype) (let ((sub (instance my-sub :init topic msgtype))) (ros::subscribe topic msgtype #'send sub :update) (while (not (send sub :get)) (ros::spin-once)) (send sub :get)))



The subscriber registered by this code is `(#<compiled-code #X5610a667fe40 send> #<my-sub #X5610a6cdb6c8> :update)`, which does not include any frame pointers and therefore can be safely invoked from any scope.

EDIT: Yes, I completely oversaw how topic and msgtype are not actually required by the helper class...
708yamaguchi commented 3 years ago

Thank you for your advice! I understand the drawbacks of (unwind-protect) and (lambda-closure ...).

I commited your solutions in https://github.com/jsk-ros-pkg/jsk_roseus/pull/667 The first commit is solution 1 and the second commit is solution 2. Could you review this PR if you have time? My concern is that by removing the :unsubscribe argument, backward compatibility will be lost.

Excuse me, but I have another question. Here is your proposed problem.

5. one-shot-subscribe -> C-c -> reset -> (ros::spin-once) -> Endless loop if a message is received
6. one-shot-subscribe -> C-c -> reset -> (let (a) (ros::spin-once)) -> Segmentation Fault if a message is received

I think problem 5. is similar to the following case.

1.eus$ (let (a) (defvar fn #'(lambda () (setq a 1))))
fn
2.eus$ fn
(lambda-closure nil 94381871208416 0 nil (setq a 1))
3.eus$ funcall fn
;; Endless loop

Then how can I reproduce problem 6.?

Affonso-Gui commented 3 years ago

I'll check that PR right away.

Yes, problem 6 is hard to reproduce because (let (b) (funcall fn)) will just set a in the global scope without any errors. Remember how I said that this is totally segfault-able but there should be some kind of fail safe mechanism rearranging it to the global scope? I would say that somehow this mechanism is not operating for subscription callbacks.

Affonso-Gui commented 3 years ago

After revisiting this issue I actually have some corrections on the annotated example:

;;;; Now I'm not sure why, but in this new repl the allocation for `(let (a) ...)' goes to a different pointer ;;;; I am not sure how this works as well, but pointer allocation also seems to work differently for nested frames

What happens here is that the new bind frame is allocated on top of the vsp, meaning that the more stacks we have the bigger the address will be. The address depends on the order of the frames, and not on the variables.

(let (a) 
       (print (third #'(lambda ())))
       (let (b)
         (print (third #'(lambda ())))
         (let (c)
           (print (third #'(lambda ()))))))

94786309311104
94786309311160
94786309311216

(let (a) 
       (print (- (third #'(lambda ())) 94786309311104))
       (let (b)
         (print (- (third #'(lambda ())) 94786309311104))
         (let (c)
           (print (- (third #'(lambda ())) 94786309311104))
           (let (d)
             (print (- (third #'(lambda ())) 94786309311104))))))

0
56
112
168

;;;; So what happens here is that the original frame at 94458722936800 is not active anymore, ;;;; and since the symbol `a' cannot be located at any dynamic frame the global value is set instead. eus$ (let (b) (let (a) (funcall fn) (sys:list-all-bindings))) ;; ((a) (b)) eus$ a ;; 1

It is not that the original frame is not active anymore, but rather that the address is now used by a different frame, in this case the let (b) one. And because there is no variable a binding in this frame it is assigned to the global value. Something like:

closure

The reverse order works as expected:

eus$ (let (a) (let (b) (funcall fn) (sys:list-all-bindings)))
;; ((b) (a . 1))

;;;; (for all I know the process could easily segfault at this time, but probably we have some fail safe code identifying that the frame is not active and reallocating to a global environment?)

There is a fail safe structure that forces global framing (https://github.com/euslisp/EusLisp/blob/master/lisp/c/eval.c#L1586-L1587), but it is not the case for this example.