Open wyatt-troia opened 3 years ago
Sorry, I cannot reproduce this as is. Can you provide more detail? What are the directory contents? Is this inside a container? A mounted FS? What's the OS?
Same error on macOS BigSur 11.1. It works without error on docker (for mac).
In my environment, adding -directory foobar
option solved that error.
Maybe it is just "too many open files".
I ran into this same issue (macOS Big Sur 11.2.1). I'm not sure exactly what the cause is, but I traced it down to a function called register()
in kqueue.go
, which calls unix.Kevent
which fails. On my mac, this function is defined in syscall_bsd.go
, which calls kevent()
in a generated file named zsyscall_darwin_amd64.go
. The bug appears to be way deep inside the OS, and its beyond me to begin debugging the root cause.
Happily, I discovered that this call hierarchy is only invoked when using the NotifyWatcher
. If I use the PollingWatcher
instead (CompileDaemon --polling=true ...
), the error goes away and all works as expected. Maybe worth adding to the README as a known issue.
I ran into this same issue (macOS Big Sur 11.2.1). I'm not sure exactly what the cause is, but I traced it down to a function called
register()
inkqueue.go
, which callsunix.Kevent
which fails. On my mac, this function is defined insyscall_bsd.go
, which callskevent()
in a generated file namedzsyscall_darwin_amd64.go
. The bug appears to be way deep inside the OS, and its beyond me to begin debugging the root cause.
Good job tracing the issue back to the syscall level. Without specifics it is hard to reason about what is going wrong but kqueue
fails if the file descriptor it is supposed to watch is invalid. Whatever that means. I tried looking up FreeBSD's kqueue implementation as it is likely to be similar but there was no obvious place where the EBADF
is coming from in this case. If I had to guess it is either a weird file type or has something to do with the underlying filesystem. Any info in this direction?
Happily, I discovered that this call hierarchy is only invoked when using the
NotifyWatcher
. If I use thePollingWatcher
instead (CompileDaemon --polling=true ...
), the error goes away and all works as expected. Maybe worth adding to the README as a known issue.
Thanks for the suggestion, there is already a section in the README about Mac OS X + polling. Is there something missing?
Good job tracing the issue back to the syscall level. Without specifics it is hard to reason about what is going wrong but kqueue fails if the file descriptor it is supposed to watch is invalid. Whatever that means.
Lol, that was my thinking as well. "Whatever that means".
If I had to guess it is either a weird file type or has something to do with the underlying filesystem. Any info in this direction?
I wish I could be of more help but I wasn't able to figure anything out in this regard. I did try i.e. --exclude-dir=".git", as well as some other dirs, however none of them solved the issue. I didn't go through that approach with much rigor, though, and doing so could lead to further evidence (i.e. excluding all directories in my repo and then adding them back 1 by 1 until I trigger the bug). I wish I could point you to the repo that this came up for me in so that you could investigate further (assuming you have access to a macOS system) however its been made private by the company I work for (obligatory check us out at https://goteleport.com!). No promises about the timeline, but I will add this to my running todo list to do it myself. Alternately or in concert, perhaps @wyatt-troia or @ypresto have an open source repo to point you at that you can use to repro.
Thanks for the suggestion, there is already a section in the README about Mac OS X + polling. Is there something missing?
Ah, indeed there is. Nope no suggestions, I just need to read the README more carefully next time.
If I had to guess it is either a weird file type or has something to do with the underlying filesystem. Any info in this direction?
I wish I could be of more help but I wasn't able to figure anything out in this regard. I did try i.e. --exclude-dir=".git", as well as some other dirs, however none of them solved the issue. I didn't go through that approach with much rigor, though, and doing so could lead to further evidence (i.e. excluding all directories in my repo and then adding them back 1 by 1 until I trigger the bug).
I understand :) Just a quick check: there were no special files like FIFOs, symlinks or unusually big files involved and no special file systems (like for example, running inside a container or a VM)?
Just a quick check: there were no special files like FIFOs, symlinks or unusually big files involved and no special file systems (like for example, running inside a container or a VM)?
There are no FIFOs and the largest individual file is 32K. One directory has symlinks but excluding it does not fix the bug.
I have an update for you, having just spent some time this morning investigating this. The approach I took initially was -- for each directory in the repo, navigate into that directory and run a CompileDaemon
command.
This was fruitful in that I identified several directories .git
, foo/
, bar/
, baz/
, and foobar/
which exited with an error, the first 4 with watcher.Addfiles():filepath.Walk(): fw.add(path): bad file descriptor
and the last with watcher.Addfiles():filepath.Walk(): open github.com/aws/aws-sdk-go/aws/ec2metadata: too many open files
.
As a sanity check I next ran the CompileDaemon
command from the top level of my repo with a -exclude-dir
option for each of the identified directories and... I got another watcher.Addfiles():filepath.Walk(): fw.add(path): bad file descriptor
. Well, I figured, there are some additional files and dot files in my top level directory, so perhaps its one of those causing the issue. I moved all of those out of the repo into a temporary folder elsewhere, rand the command again and... same result. At this point I was a bit stumped, and ran some sanity checks to confirm that my -exclude-dir
's were formatted properly and working (they were), and thus moved on to an even more meticulous approach.
I took all of the files and directories out of the repo, and then began adding them back in 1 by 1, running CompileDaemon
each time to check whether I got one of the errors of interest. This led to another breakthrough: I identified a new directory qux/
which, when added back, caused the same "bad file descriptor" error.
The interesting thing here is, if I navigate into qux/
and run the CompileDaemon
command, I don't get this error! CompileDaemon
runs as expected! (It gives me another error since qux/
is not a go
directory and so the build command doesn't work, but I'm presuming that would only happen after the section of code that's throwing these errors has run successfully). Huh? So it seems to me that the program attempting to read the directory itself is what's causing the error.
As far as I can tell, there is nothing out of the ordinary about this directory. It is a perfectly normal directory of regular size with the same attributes as the other directories currently in the repo that aren't causing this issue:
$ ls -li
total 0
848305 drwxr-xr-x 6 ibeckermayer staff 192 Apr 15 15:05 wibble
848321 drwxr-xr-x 5 ibeckermayer staff 160 May 8 09:28 wobble
848329 drwxr-xr-x 11 ibeckermayer staff 352 May 8 09:28 qux
848744 drwxr-xr-x 22 ibeckermayer staff 704 May 8 09:28 wubble
At this point I'm thoroughly stumped, seeking suggestions of what I might try next.
Thank you so much for the debugging effort, this is quality information!
It might of course be that there is something special going on for that directory (or your file system/disk is broken :)) but for the sake of our sanity let's go through some more favorable theories/questions:
ulimit
in place that triggers a check which fails and as a consequence the file descriptor is deemed 'bad'? qux
is not important and any other sufficiently 'large' (in terms of number of files contained or possibly the depth of subdirectories) directory might suffice, such as the .git
directory for example - can you simply swap qux
with .git
and the same error pops out?qux
simultaneously?qux
special (if it really is)? does stat qux
show something wildly different than for another 'working' directory?Thanks again for putting time and effort into this :)
I successfully ran CompileDaemon a month ago when I downloaded it, but now whenever I try to run it I get:
I've tried deleting the src and bin files for CompileDaemon and running
go get "github.com/githubnemo/CompileDaemon"
, but it didn't change anything.I'm on a 2017 Macbook Pro.