Open tkmtnt7000 opened 3 years ago
僕も一緒にデバッグしていたのですが、「resetに効果がありそう」というところまでしか分からず、根本的な解決理由が分かりませんでした。 どなたか教えてくださると勉強になり助かります。
Ok, this was actually quite interesting to take a look at, and I think I have figured it out.
The problem revolves around the fact that the one-shot-subscribe
is doing some pretty risky stuff to simulate closures in eus.
https://github.com/jsk-ros-pkg/jsk_roseus/blob/master/roseus/euslisp/roseus-utils.l#L1117-L1141
Particularly, there are three aspects creating this bug:
one-shot-subscribe
checks if there are any other active subscribers with the ros::get-num-publishers
(yes, the function name is misleading), and proceeds to using the existing subscriber if any is found.((lambda-closure nil 93867875589976 0 (msg) (setq lmsg msg)))
. And what happens here is that when entering another repl the new dynamic environment being introduced kind of shades the previous one, so the message is properly received and handled, but the value is assigned to a different value of lmsg
and therefore the loop never exits.Try the following:
There are a few possible solutions to this, but I would just reject cases in which ros::get-num-publishers
is non-nil altogether.
EDIT: it really is kind of nonsense to use any previous subscriber registered and just hope that they will set the lmsg
for you.
(I would also like to add some self-advertising and say that with a proper exception handling framework we could for example ensure that the topic has been unsubscribed before proceeding to the next repl)
Thank you very much for the explanation. I understand the root of the problem a little better now.
As I understand it, the function #'(lambda (msg) (setq lmsg msg))
itself is registered in the subscriber, which is causing the problem.
In addition to the problems you reported, I found the following problem.
After the following steps,
one-shot-subscribe-> C-c -> reset -> (let (a) (one-shot-subscribe ...)) -> (topic is published, but loop never exits) -> C-c -> reset
lmsg
is binded like below
5.irteusgl$ lmsg
#<std_msgs::string #X55fad4cf26a8>
Excuse me, but I have an additional question: I would like to know the difference between the following two cases. I was expecting Bug to occur in both cases, but in fact it did not occur in the case above. What is the effect of (let (a) ...) ?
Here is an annotated example. Remember that closures are executed from their own lexical environment, and not from the one active at invocation
;;;; Create a "pointer-closure"
eus$ (let (a) (defvar fn #'(lambda () (setq a 1))))
;; fn
eus$ fn
;; (lambda-closure nil 94458722936800 0 nil (setq a 1))
;;;; Observe that any occurence of `(let (a) ...)' will lead to the same pointer location
eus$ (let (a) #'(lambda () nil))
;; (lambda-closure nil 94458722936800 0 nil nil)
eus$ (let (a) #'(lambda () nil))
;; (lambda-closure nil 94458722936800 0 nil nil)
eus$ (let (a) (sys:list-all-bindings))
;; ((a))
;;;; Enter a new repl
eus$ (let (a) (error "this"))
;; eus 0 error: this in (error "this")
;;;; The lexical environment of the time of the error is preserved
;;;; (which is what allows for dynamic-eval debugging)
E1-eus$ (sys:list-all-bindings)
;; ((a))
;;;; Now I'm not sure why, but in this new repl the allocation for `(let (a) ...)' goes to a different pointer
E1-eus$ (let (a) #'(lambda () nil))
;; (lambda-closure nil 94458722937456 0 nil nil)
E1-eus$ (let (a) #'(lambda () nil))
;; (lambda-closure nil 94458722937456 0 nil nil)
E1-eus$ (let (a) (sys:list-all-bindings))
;; ((a) (a))
;;;; So when we invoke the function the value is set on the second environment
;;;; But since value references point to the first value it always remains null
E1-eus$ (let (a) (funcall fn) (sys:list-all-bindings))
;; ((a) (a . 1))
;;;; Reset to the original repl
E1-eus$ reset
;;;; There are no active environments and frame allocation goes to the same pointer as before
eus$ (sys:list-all-bindings)
;; nil
eus$ (let (a) #'(lambda () nil))
;; (lambda-closure nil 94458722936800 0 nil nil)
;;;; So if we recreate the `(let (a) ...)' environment and invoke the function everything goes well
;;;; (this is why calling one-shot-subscribe here works even with an outdated callback)
eus$ (let (a) (funcall fn) (sys:list-all-bindings))
;; ((a . 1))
;;;; I am not sure how this works as well, but pointer allocation also seems to work differently for nested frames
;;;; Note that the pointer for the same active `(let (a) ...)' environment is different
eus$ (let (b) (let (a) #'(lambda () nil)))
;; (lambda-closure nil 94458722936856 0 nil nil)
;;;; So what happens here is that the original frame at 94458722936800 is not active anymore,
;;;; and since the symbol `a' cannot be located at any dynamic frame the global value is set instead.
;;;; (for all I know the process could easily segfault at this time, but probably we have some fail safe code identifying that the frame is not active and reallocating to a global environment?)
eus$ (let (b) (let (a) (funcall fn) (sys:list-all-bindings)))
;; ((a) (b))
eus$ a
;; 1
@Affonso-Gui Thank you very much for the repeated explanations.
After talking to @pazeshun, I learned that similar problems have been reported before. https://github.com/euslisp/EusLisp/issues/147#issuecomment-153003343
We have also discovered a more bad bug behavior. We believe this is fatal because the bug occurs without using C-c
.
(one-shot-subscribe ... :unsubscribe nil) ;; topic received -> (let (a) (one-shot-subscribe ...) ;; Bug (repl do not return from one-shot-subscribe)
Here is a summary of the current problems.
1. (one-shot-subscribe ...) -> C-c -> reset -> (let (a) (one-shot-subscribe ...)) ;; Bug
2. (one-shot-subscribe ... :unsubscribe nil) -> (let (a) (one-shot-subscribe ...)) ;; Bug
3. (one-shot-subscribe ...) -> C-c -> (one-shot-subscribe ...) ;; Bug
In response to these problems, @YoheiKakiuchi gave me the following advices.
(unwind-protect)
so that (ros::unsubscribe) will be called even when C-c -> reset.(lambda-closure nil ...)
instead of #'(lambda ...)
for the callback function of (ros::subscribe ...)
'(lambda ...)
instead of #'(lambda ...)
for the callback function of (ros::subscribe ...)
. However, '(lambda ...)
is not a LAMCLOSURE and an error occurs, so I gave up.
https://github.com/jsk-ros-pkg/jsk_roseus/blob/b214284e24ad6287bf17545b9a73c541190fac1a/roseus/roseus.cpp#L360-L370Currently, I think
I am working on the following branch to modify one-shot-subscribe. I will create PR if these solutions seems good. https://github.com/708yamaguchi/jsk_roseus/tree/one-shot-subscribe-error
Some more problems:
4. (ros::subscribe topic ...) -> (one-shot-subscribe topic) -> Bug
5. one-shot-subscribe -> C-c -> reset -> (ros::spin-once) -> Endless loop if a message is received
6. one-shot-subscribe -> C-c -> reset -> (let (a) (ros::spin-once)) -> Segmentation Fault if a message is received
A few drawbacks to your solution:
unwind-protect is only called after reset, so it doesn't cover Problem 3, and can also cause problems if the user defines another subscriber in the new repl. one-shot-subscribe -> C-c -> continue programming -> eventually call ros::subscribe on the same topic -> when resetting the user defined subscriber will be gone (probably without the user realizing it)
The advantage of using lambda-closure
is that you can force the frame pointer to zero even within let blocks (which means global scope and is not what we want here). There is no difference if you are combining the pointer from another lambda; just notice how you are creating a form identical to the (lambda-closure nil 94458722936800 0 nil (setq a 1))
observed above.
My solution for problems 1-4:
lmsg
in the active frame. unsubscribe
key argument; the user is always welcome to use a proper ros::subscribe
call if they want to keep the subscriber active.My solution for problems 5-6:
(defclass my-sub :slots (topic msgtype msg))
(defmethod my-sub (:init (_topic _msgtype) (setq topic _topic) (setq msgtype _msgtype) self) (:update (_msg) (setq msg _msg)) (:get () msg))
(defun my-one-shot (topic msgtype) (let ((sub (instance my-sub :init topic msgtype))) (ros::subscribe topic msgtype #'send sub :update) (while (not (send sub :get)) (ros::spin-once)) (send sub :get)))
The subscriber registered by this code is `(#<compiled-code #X5610a667fe40 send> #<my-sub #X5610a6cdb6c8> :update)`, which does not include any frame pointers and therefore can be safely invoked from any scope.
EDIT: Yes, I completely oversaw how topic and msgtype are not actually required by the helper class...
Thank you for your advice!
I understand the drawbacks of (unwind-protect)
and (lambda-closure ...)
.
I commited your solutions in https://github.com/jsk-ros-pkg/jsk_roseus/pull/667
The first commit is solution 1 and the second commit is solution 2.
Could you review this PR if you have time?
My concern is that by removing the :unsubscribe
argument, backward compatibility will be lost.
Excuse me, but I have another question. Here is your proposed problem.
5. one-shot-subscribe -> C-c -> reset -> (ros::spin-once) -> Endless loop if a message is received
6. one-shot-subscribe -> C-c -> reset -> (let (a) (ros::spin-once)) -> Segmentation Fault if a message is received
I think problem 5. is similar to the following case.
1.eus$ (let (a) (defvar fn #'(lambda () (setq a 1))))
fn
2.eus$ fn
(lambda-closure nil 94381871208416 0 nil (setq a 1))
3.eus$ funcall fn
;; Endless loop
Then how can I reproduce problem 6.?
I'll check that PR right away.
Yes, problem 6 is hard to reproduce because (let (b) (funcall fn))
will just set a
in the global scope without any errors.
Remember how I said that this is totally segfault-able but there should be some kind of fail safe mechanism rearranging it to the global scope? I would say that somehow this mechanism is not operating for subscription callbacks.
After revisiting this issue I actually have some corrections on the annotated example:
;;;; Now I'm not sure why, but in this new repl the allocation for `(let (a) ...)' goes to a different pointer ;;;; I am not sure how this works as well, but pointer allocation also seems to work differently for nested frames
What happens here is that the new bind frame is allocated on top of the vsp, meaning that the more stacks we have the bigger the address will be. The address depends on the order of the frames, and not on the variables.
(let (a)
(print (third #'(lambda ())))
(let (b)
(print (third #'(lambda ())))
(let (c)
(print (third #'(lambda ()))))))
94786309311104
94786309311160
94786309311216
(let (a)
(print (- (third #'(lambda ())) 94786309311104))
(let (b)
(print (- (third #'(lambda ())) 94786309311104))
(let (c)
(print (- (third #'(lambda ())) 94786309311104))
(let (d)
(print (- (third #'(lambda ())) 94786309311104))))))
0
56
112
168
;;;; So what happens here is that the original frame at 94458722936800 is not active anymore, ;;;; and since the symbol `a' cannot be located at any dynamic frame the global value is set instead. eus$ (let (b) (let (a) (funcall fn) (sys:list-all-bindings))) ;; ((a) (b)) eus$ a ;; 1
It is not that the original frame is not active anymore, but rather that the address is now used by a different frame, in this case the let (b)
one. And because there is no variable a
binding in this frame it is assigned to the global value. Something like:
The reverse order works as expected:
eus$ (let (a) (let (b) (funcall fn) (sys:list-all-bindings)))
;; ((b) (a . 1))
;;;; (for all I know the process could easily segfault at this time, but probably we have some fail safe code identifying that the frame is not active and reallocating to a global environment?)
There is a fail safe structure that forces global framing (https://github.com/euslisp/EusLisp/blob/master/lisp/c/eval.c#L1586-L1587), but it is not the case for this example.
Fetchで音声認識をさせているときにone-shot-subscribeが返ってこないことがありました。かんたんなケースで再現させたのが以下になります。
one-shot-subscribeを呼んでトピックを待っている最中にCtrl-cで止めてもう一度one-shot-subscribeを起動するとsubscribeされない。 Ctrl-cの後にresetを入れると正しくsubscribeされる。
resetにはどのような効果があるのか知りたいです。 @708yamaguchi
再現コード
ターミナル1
ターミナル2