jezek / xgb

The X Go Binding is a low-level API to communicate with the X server. It is modeled on XCB and supports many X extensions.
Other
130 stars 13 forks source link

Conn.NewRequest() seems to hang forever #23

Closed hhvn-uk closed 3 months ago

hhvn-uk commented 3 months ago

Hi, I wrote a program that uses xgb and xgbutil.

Sometimes it will freeze in the following circumstance:

(dlv) goroutine 1
Switched from 0 to 1 (thread 15150)
(dlv) bt
 0  0x000000000043c1ce in runtime.gopark
    at /usr/lib/go/src/runtime/proc.go:403
 1  0x000000000044d605 in runtime.selectgo
    at /usr/lib/go/src/runtime/select.go:327
 2  0x000000000053582f in github.com/jezek/xgb.(*Conn).NewRequest
    at ./go/pkg/mod/github.com/jezek/xgb@v1.1.0/xgb.go:346
 3  0x000000000053582f in github.com/jezek/xgb/xproto.PutImage
    at ./go/pkg/mod/github.com/jezek/xgb@v1.1.0/xproto/xproto.go:12539
 4  0x000000000054d213 in github.com/jezek/xgbutil/xgraphics.(*Image).xdraw
    at ./go/pkg/mod/github.com/jezek/xgbutil@v0.0.0-20230603163917-04188eb39cf0/xgraphics/xsurface.go:205
 5  0x0000000000568a32 in github.com/jezek/xgbutil/xgraphics.(*Image).XDraw
    at ./go/pkg/mod/github.com/jezek/xgbutil@v0.0.0-20230603163917-04188eb39cf0/xgraphics/xsurface.go:134
 6  0x0000000000568a32 in hhvn.uk/hbspbar/drw.(*Window).Paint
    at ./git/hbspbar/drw/x.go:108
 7  0x0000000000571fb6 in main.bar.draw
    at ./git/hbspbar/bar.go:152
 8  0x0000000000572cba in main.Redraw
    at ./git/hbspbar/bar.go:179
 9  0x0000000000572cba in main.main
    at ./git/hbspbar/main.go:94
10  0x000000000043bd9d in runtime.main
    at /usr/lib/go/src/runtime/proc.go:271
11  0x000000000046d8a1 in runtime.goexit
    at /usr/lib/go/src/runtime/asm_amd64.s:1695
(dlv) up 2
> runtime.gopark() /usr/lib/go/src/runtime/proc.go:403 (PC: 0x43c1ce)
Warning: debugging optimized function
Frame 2: ./go/pkg/mod/github.com/jezek/xgb@v1.1.0/xgb.go:346 (PC: 53582f)
   341:         seq := make(chan struct{})
   342:         select {
   343:         case c.reqChan <- &request{buf: buf, cookie: cookie, seq: seq}:
   344:                 // request is in buffer
   345:                 // wait until request is processed or connection is closed
=> 346:                 select {
   347:                 case <-seq:
   348:                         // request was successfully sent to X server
   349:                 case <-c.doneSend:
   350:                         // c.sendRequests is down, your request was not handled
   351:                 }
(dlv)

I've come back to my computer to see the program frozen for hours. (It is a status bar, so it displays the "current" time, which was whenever it froze).

I'm not sure what information I can give you about the state that xgb is in when this happens that could be helpful, but if you tell me I'm happy to find out.

hhvn-uk commented 3 months ago

Oh and the source is at hhvn.uk/hbspbar

jezek commented 3 months ago

Hi, I've skimmed through your (hbspbar), xgbutil and xgb code, but found nothing that could throw that error.

First I thought, that the problem could be, that in your code you use xgb.WaitForEvent() and ugbutil uses the function too and that's why the xproto.PutImage halts, cause the WaitForEvent in hbspbar/drv/x.go steals the event, but it seems not to be the case. But I could miss something.

From the error you provided, it seems that X is not acknowledging your request and the routine waits forever. Maybe there is some bug on the X server side (not very probable). Are you using original X11, or are you on Wayland + XWayland?

When there is no bug on the server side, I don't think xgb is at fault. Maybe xgbutil has some bug, but I don't really know, cause I don't know the internal workings of xgbutil. My main interest is the xgb package. I've just forked xgbutil and replaced the imports to jezek/xgb on someone suggestion, but I don't plan to debug or do something with it. I'm just accepting reasonable merge requests to xgbutil.

What would I suggest?

This is everything I can do for you now. Happy debugging and coding and I hope you resolve your issues. Post a comment here, what was the cause, when you find the solution.

hhvn-uk commented 3 months ago

First I thought, that the problem could be, that in your code you use xgb.WaitForEvent() and ugbutil uses the function too

Yes, I think this would be it. I have a goroutine which is running WaitForEvent() constantly in a for loop. WaitForEvent reads from a channel, so if that goroutine consumes the event, then PutImage will never be sent the event.

The xevent pkg uses WaitForEvent under the hood, and PollForEvent also reads from the channel. So a direct substitution won't help.

There are two ways I can think to solve this:

I will try implementing the mutex and see how it goes.

hhvn-uk commented 3 months ago

Actually, on further thought: the only reason I am using WaitForEvent at the moment is to see whether the connection has been closed. This is signalled by both return values being nil, which is exactly what PollForEvent will return if there are no events/errors in the queue..

I might have to tinker with xgb a bit.

hhvn-uk commented 3 months ago

Hmmm... I implemented a way to poll for events while checking if the connection has been closed.

However I then went to check what functions used WaitForEvent so I knew to lock the mutex for them.

$ grep WaitForEvent -R .
./xgbutil.go:   // Evqueue is the queue that stores the results of xgb.WaitForEvent.
./xevent/xevent.go://   ev, err := XUtilValue.Conn().WaitForEvent()
./xevent/eventloop.go:// If 'block' is True, then call 'WaitForEvent' before sucking up
./xevent/eventloop.go:          ev, err := xu.Conn().WaitForEvent()

Nothing within xgbutil I'm using actually uses it. Instead they use cookies, which sidesteps the whole problem by using a seperate reply channel

This leads me to believe the problem lies elsewhere.

hhvn-uk commented 3 months ago

Now that I am again clueless as to what is wrong, I might as well fill you in on this.

From the error you provided, it seems that X is not acknowledging your request and the routine waits forever. Maybe there is some bug on the X server side (not very probable). Are you using original X11, or are you on Wayland + XWayland?

Just a regular Xorg. I agree, I doubt it is a server bug.

hhvn-uk commented 3 months ago

Had a look at the original backtrace I posted again.

It's getting stuck in xgb.NewRequest which does in fact seem like an xgb bug. I'm going to have a look and see if I can figure out what's happening.

Sorry for bombarding you with messages, I've just been typing stuff out as I think about it.

hhvn-uk commented 3 months ago

Annoyingly I lost my broken instance of the program so I can't dlv on that anymore.

I've added a lot of printfs to the relevant xgb functions and am now waiting for the program to break again. (Which might take quite a while)

Also got the race detector turned on to see if that picks up on anything.

hhvn-uk commented 3 months ago

Okay, I had my suspicions this may be the case.

The problem is in Conn.noop:

func (c *Conn) noop() error {
    cookie := c.NewCookie(true, true)
    cookie.Sequence = c.newSequenceId()
    c.cookieChan <- cookie
    if err := c.writeBuffer(c.getInputFocusRequest()); err != nil {
        return err
    }
    cookie.Reply() // (<--- it's stuck here) wait for the buffer to clear
    return nil
}

(That's as far as my suspicion went).

I attached dlv to it to have a look, and it is stuck on a select{}:

Frame 2: ./git/hbspbar/vendor/github.com/jezek/xgb/cookie.go:98 (PC: 5d045f)
    93:         if c.errorChan == nil {
    94:                 return nil, errors.New("Cannot call 'replyChecked' on a cookie that
" +
    95:                         "is not expecting a reply or an *error*.")
    96:         }
    97:
=>  98:         select { (<--- here)
    99:         case reply := <-c.replyChan:
   100:                 return reply, nil
   101:         case err := <-c.errorChan:
   102:                 return nil, err
   103:         case <-c.conn.doneRead:

The goroutine responsible for sending on the err and reply channels is the one running readResponses, which is stuck trying to send on eventChan.

Frame 3: ./git/hbspbar/vendor/github.com/jezek/xgb/xgb.go:586 (PC: 5d4cb1)
   581:                                 if err != nil { // this is an error to a request
   582:                                         // synchronous processing
   583:                                         if cookie.errorChan != nil {
   584:                                                 cookie.errorChan <- err
   585:                                         } else { // asynchronous processing
=> 586:                                                 c.eventChan <- err (<--- here)
   587:                                                 // if this is an unchecked reply, ping the cookie too
   588:                                                 if cookie.pingChan != nil {
   589:                                                         cookie.pingChan <- true
   590:                                                 }
   591:                                         }

eventChan is read by WaitForEvent and friends. It has a buffer of 5000.

So within hbspbar, what is happening?

This suggests to me either:

The nature of this bug is really annoying as the time between my last comment in the thread and this one was the amount of time it took for the bar to freeze. As such it'll probably take ages to figure out if I've solved the problem unless I find some way to simulate the problem. I'm going to do some more poking around in dlv and see if it can offer any insights.

hhvn-uk commented 3 months ago

If this happened perhaps some requests sent to the X server never got replies, and therefore the cookie buffer got filled up. In that case the only solution I see would be to add a timeout to cookies.

Okay I think this can be ruled out (in readResponses):

// Checked requests with replies
case cookie.replyChan != nil && cookie.errorChan != nil:
    Logger.Printf("Found cookie with sequence id %d that is "+
        "expecting a reply but will never get it. Currently "+
        "on sequence number %d", cookie.Sequence, seq)
// Unchecked requests with replies
case cookie.replyChan != nil && cookie.pingChan != nil:
    Logger.Printf("Found cookie with sequence id %d that is "+
        "expecting a reply (and not an error) but will never "+
        "get it. Currently on sequence number %d",
        cookie.Sequence, seq)
// Checked requests without replies
case cookie.pingChan != nil && cookie.errorChan != nil:
    cookie.pingChan <- true
    // Unchecked requests without replies don't have any channels,
    // so we can't do anything with them except let them pass by.
}
hhvn-uk commented 3 months ago
  • Or, the main goroutine in hbspbar read less events than it generated and so the buffer filled up. In that case I would need to make it preferentially gobble up events over redrawing the window. This will be much easier to implement, so I'll go ahead and try that.

I realized that I don't actually use any of the events/errors in the main loop. All I need to know is whether or not the connect has been closed, so I will be testing this out:

diff --git a/drw/x.go b/drw/x.go
index 910262a..687dc8b 100644
--- a/drw/x.go
+++ b/drw/x.go
@@ -2,6 +2,7 @@ package drw // import "hhvn.uk/hbspbar/drw"

 import (
        "os"
+       "log"
        "image"

        "hhvn.uk/hbspbar/config"
@@ -22,7 +23,7 @@ type eventwrap struct {
 }

 var InitErr error
-var Events  chan *eventwrap
+var XClosed chan bool

 var xc      *xgb.Conn
 var xu      *xgbutil.XUtil
@@ -59,12 +60,17 @@ func init() {
                return
        }

-       Events := make(chan *eventwrap)
+       XClosed := make(chan *eventwrap)

        go func() { for {
                ev, err := xc.WaitForEvent()
-               w := eventwrap{ev, err}
-               Events <- &w
+               if ev == nil && err == nil {
+                       XClosed <- true
+                       return
+               }
+               if err != nil {
+                       log.Printf("X.WaitForEvents: %v\n", err)
+               }
        }}()
 }

diff --git a/main.go b/main.go
index bc4e0d4..f71c8ec 100644
--- a/main.go
+++ b/main.go
@@ -45,23 +45,8 @@ func main() {
                case s := <-signals:
                        common.Error("%s\n", s)
                        break mainloop
-               case e := <- drw.Events:
-                       if e.Ev == nil && e.Err == nil {
-                               common.Error("X connection clossed\n")
-                               break mainloop
-                       }
-
-                       if e.Err != nil {
-                               common.Perror("X.WaitForEvents", e.Err)
-                               continue
-                       }
-
-                       // switch e.Ev.(type) {
-                       // case xproto.DestroyNotifyEvent:
-                       //      Err <- fmt.Errorf("Window destroyed")
-                       //      return
-                       // default:
-                       // }
+               case <- drw.XClosed:
+                       break mainloop
                case err := <-bspc.EvErr:
                        common.Error("Couldn't read event: %s\n", err)
                        break mainloop

I uhh... just had a look back at that patch now I pasted it: Events := make(chan *eventwrap).

Note the :=. I believe that may have been the culprit this whole time.

sigh

Oh and I'll put in a PR to warn if the event channel has become full as that will probably help stop bugs like this in the future.

jezek commented 3 months ago

If I understand correctly, you found the bug in your code. Good work. This means xgb is not at fault and we can close this issue. Am I right?

I'll look at your PR #24 and see what can be done as soon I have some time to spare. Thanx.

PS: Just a suggestion. When I skimmed through your code, I said to my self that is is hard to read, cause of too many global variables. I used to code like this (using many global variables) in the past, but then I figured out, it comes with lot of unintentional bugs and since than I try to use global variables just for errors and constants. My code is more readable (for me) since than and less buggy. ;)

Anyway, good work and I wish happy coding with less bugs in the future.

hhvn-uk commented 3 months ago

Yeah, the mistake was on my end.

This was my first project in go and I immediately grabbed for the "ooh shiny packages" and wanted to make everything a self contained package. It probably would've made more sense the put the X stuff in a struct in retrospect.