keybase / keybase-issues

A single repo for managing publicly recognized issues with the keybase client, installer, and website.
902 stars 37 forks source link

keybase id fails to find keybased.sock on OS X #3277

Open rptb1 opened 6 years ago

rptb1 commented 6 years ago
rb@silverbird$ keybase id rptb1

We couldn't connect but there is a Keybase service (keybase.service) running (743).

You might try restarting it: keybase launchd restart keybase.service

▶ ERROR There were multiple errors: dial unix /Volumes/Silverbird-HD/Local/Users/rb/Library/Group Containers/keybase/Library/Caches/Keybase/keybased.sock: connect: invalid argument; dial unix /Volumes/Silverbird-HD/Local/Users/rb/Library/Caches/Keybase/keybased.sock: connect: no such file or directory

Note, relaunching the service does not help. So I did some poking around then tried symbolic links from to the Group Containers directory from the other Caches directory, like this:

rb@silverbird$ pwd
/Volumes/Silverbird-HD/Local/Users/rb/Library/Caches/Keybase
rb@silverbird$ ls -l
total 32
-rw-r--r--  1 rb  staff   89 12 Sep 13:05 kbfs.info
lrwxr-xr-x  1 rb  staff  104 12 Sep 13:24 kbfsd.sock -> /Volumes/Silverbird-HD/Local/Users/rb/Library/Group Containers/keybase/Library/Caches/Keybase/kbfsd.sock
-rw-r--r--  1 rb  staff   94 12 Sep 13:18 keybased.info
lrwxr-xr-x  1 rb  staff  107 12 Sep 13:24 keybased.sock -> /Volumes/Silverbird-HD/Local/Users/rb/Library/Group Containers/keybase/Library/Caches/Keybase/keybased.sock

This allowed id to complete.

rb@silverbird$ keybase id rptb1
▶ INFO Identifying rptb1
✔ "rptb1" on reddit: https://www.reddit.com/r/KeybaseProofs/comments/9f10l8/my_keybase_proof_redditrptb1_keybaserptb1/
✔ "rptb1" on github: https://gist.github.com/e67c8b07c74b34cd11330f733f347a78
✔ "rptb1" on twitter: https://twitter.com/rptb1/status/1039608884128821249
✔ "richard.brooksby" on facebook: https://www.facebook.com/richard.brooksby/posts/10102993592483830

However, it did not help with https://github.com/keybase/client/issues/5399#issuecomment-420280044

There seems to be some directory confusion going on. Has someone assumed something about paths to home directories?

The log id after the symlink hack above is acb314d287d41384527ce11c

I'm running OS X 10.11.6.

maxtaco commented 6 years ago

This is something we've never seen. Someone can hopefully take a look this sprint. It's likely something particular to your configuration.

rptb1 commented 6 years ago

Thanks. I'm been able to log in and provision, but the command line and app don't agree about whether I'm logged in. screen shot 2018-09-12 at 13 53 20

maxtaco commented 6 years ago

Likely the electron app is looking for your socket in the wrong place.

rptb1 commented 6 years ago

This might be relevant. My home directory is not in the standard place in /Users, because I'm managing files between an SSD and HDD. screen shot 2018-09-12 at 13 56 32 I have this link just in case software makes a poor assumption:

rb@silverbird$ ls -l /Users/rb
lrwxr-xr-x  1 root  admin  37 21 Dec  2017 /Users/rb -> /Volumes/Silverbird-HD/Local/Users/rb
rb@silverbird$ echo $HOME
/Volumes/Silverbird-HD/Local/Users/rb
rptb1 commented 6 years ago

I'm a new user and not in urgent need of Keybase, so there's no need to treat my request as a huge priority. However, my (small) company is evaluating it for internal comms.

maxtaco commented 6 years ago

Almost certainly that explains the issue. I doubt we've tested this.

Did this issue just start with the latest release btw?

rptb1 commented 6 years ago

I only started trying Keybase last week. I noticed one update. However, this issue was present before the update.

rb@silverbird$ keybase --version
keybase version 2.6.0-20180911163448+cbaeeb8001
maxtaco commented 6 years ago

Ok, got it. Maybe try this? Quit out of keybase, and then do:

launchctl setenv HOME /Volumes/Silverbird-HD/Local/Users/rb

And then try launching again?

rptb1 commented 6 years ago

No effect.

rb@silverbird$ rm *.sock
rb@silverbird$ launchctl setenv HOME "$HOME"
rb@silverbird$ launchctl getenv HOME
/Volumes/Silverbird-HD/Local/Users/rb

New log with this change dfaa9d0ee33f806bd4ff701c. But for now:

rb@silverbird$ launchctl unsetenv HOME
rptb1 commented 6 years ago

Note, HOME is already set correctly by launchctl. See below.

rb@silverbird$ launchctl submit -l dump-env -o /tmp/1 -e /tmp/2 -- env
rb@silverbird$ cat /tmp/1
TMPDIR=/var/folders/18/c1s1h6z90b9579wdl6d8zwg80000gn/T/
SHELL=/bin/bash
HOME=/Volumes/Silverbird-HD/Local/Users/rb
Apple_PubSub_Socket_Render=/private/tmp/com.apple.launchd.WmXKTaVCFZ/Render
SSH_AUTH_SOCK=/private/tmp/com.apple.launchd.A3EFVpWnbq/Listeners
LOGNAME=rb
DISPLAY=/private/tmp/com.apple.launchd.T0Y7AqLGoq/org.macosforge.xquartz:0
PATH=/Volumes/Silverbird-HD/Local/Users/rb/bin:/Volumes/Silverbird-HD/Local/Users/rb/opt/ec2-api-tools/bin:/opt/local/bin:/usr/local/bin:/usr/X11R6/bin:/usr/bin:/bin
XPC_SERVICE_NAME=dump-env
USER=rb
XPC_FLAGS=0x0
P4CONFIG=.p4config
maxtaco commented 6 years ago

ok. Cc @keybase/react-hackers on finding where your home dir is.

rptb1 commented 6 years ago

The whole home directory thing might be a red herring -- but it's a red herring with a smoking gun! :)

mmaxim commented 6 years ago

So if you kill those two symlinks the command line app doesn't work? What happens if you type echo $HOME on the command line? Thanks!

rptb1 commented 6 years ago

See https://github.com/keybase/keybase-issues/issues/3277#issuecomment-420638436. HOME matches what it is set to in the System Preferences (and in the system directories). Note that I've run with my home directory set this way for about six years without problems.

The command line app can't find the socket. Some things seem to work with --standalone. For example:

rb@silverbird$ keybase status

Keybase isn't running. To start, you can run:

    open /Applications/Keybase.app

▶ ERROR There were multiple errors: dial unix /Volumes/Silverbird-HD/Local/Users/rb/Library/Group Containers/keybase/Library/Caches/Keybase/keybased.sock: connect: invalid argument; dial unix /Volumes/Silverbird-HD/Local/Users/rb/Library/Caches/Keybase/keybased.sock: connect: no such file or directory
rb@silverbird$ keybase --standalone status
Username:      rptb1
Logged in:     yes

Device:
    name:      Silverbird
    ID:        fdb6ba2c004aed933cd9cc5b70771018
    status:    active
...
rptb1 commented 6 years ago

It appears to be a confusion between ~/Library/Caches/Keybase and ~/Library/Group Containers/keybase/Library/Keybase. I've skimmed the source code looking for the origin of these paths. I'm trying to track down the function of the "Group Containers" directory to see if it comes from the system library somewhere.

rptb1 commented 6 years ago

It appears to be to do with sandboxing. See containerURL(forSecurityApplicationGroupIdentifier:) . Perhaps some parts of the system are being sandboxed (such as the application) and some parts are not.

mmaxim commented 6 years ago

So I was able to repro this by using an external disk, so something about hat is causing the problem. Just setting the home directory works fine.

mmaxim commented 6 years ago

Actually nvm, it did work, but took forever to start for some reason.

rptb1 commented 6 years ago

For what it's worth, /Volumes/Silverbird-HD is a partition on the second internal disk (on the SATA bus), but it's not the system volume. This probably doesn't make a difference, but sometimes the Mac makes a distinction between "internal" and "external". The system volume (mounted at "/") is an SSD. It might be that OS X is doing some extra sandboxing because code is not on the system disk -- just speculating.

rptb1 commented 6 years ago

A fresh account on the same machine with HOME in /Users can successfully run Keybase. See log ca7f51e14bb218003d5ccf1c.

rptb1 commented 6 years ago

Moving that account's home directory to /Volumes/Silverbird-HD/Local/Users produces a different error!

Logged out, then

rb@silverbird$ sudo rsync -avi /Users/keybase-test/ /Volumes/Silverbird-HD/Local/Users/keybase-test/

Then update the account HOME in the Users and Groups control panel. Logged back in, Keybase app failed to get past "Loading...". Command line did this:

silverbird:keybase-test keybase-test$ echo $HOME
/Volumes/Silverbird-HD/Local/Users/keybase-test
silverbird:keybase-test keybase-test$ keybase id rptb1
▶ WARNING | Changing current working directory because path for dialing is too long
▶ WARNING | Changing current working directory because path for dialing is too long
▶ INFO Identifying rptb1
✔ "rptb1" on github failed: API network error: Get https://gist.githubusercontent.com/rptb1/e67c8b07c74b34cd11330f733f347a78/raw/f5eb6055f3cef1dc69200301ec1fbed0c4e11653/keybase.md: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) (code=170)
✔ "rptb1" on twitter failed: API network error: Get https://mobile.twitter.com/rptb1/status/1039608884128821249: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) (code=170)
✔ "richard.brooksby" on facebook failed: API network error: Get https://www.facebook.com/richard.brooksby/posts/10102993592483830: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) (code=170)
✔ "rptb1" on reddit failed: API network error: Get https://www.reddit.com/r/KeybaseProofs/comments/9f10l8/my_keybase_proof_redditrptb1_keybaserptb1/.json: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) (code=170)

This "path too long" error might be a clue.

The log id for this is 66b68da2ba6673557b9e621c

rptb1 commented 6 years ago

Moving the account's home directory to a long path on the system volume (under "/") also breaks things in the same way. This suggests that it's not to do with crossing filesystems or the disk hardware.

silverbird:keybase-test keybase-test$ echo $HOME
/This/path/is/too/long/for/your/feeble/software
silverbird:keybase-test keybase-test$ keybase id rptb1
▶ WARNING | Changing current working directory because path for dialing is too long
▶ WARNING | Changing current working directory because path for dialing is too long
▶ INFO Identifying rptb1
✔ "rptb1" on reddit failed: API network error: Get https://www.reddit.com/r/KeybaseProofs/comments/9f10l8/my_keybase_proof_redditrptb1_keybaserptb1/.json: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) (code=170)
✔ "rptb1" on twitter failed: API network error: Get https://mobile.twitter.com/rptb1/status/1039608884128821249: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) (code=170)
✔ "rptb1" on github failed: API network error: Get https://gist.githubusercontent.com/rptb1/e67c8b07c74b34cd11330f733f347a78/raw/f5eb6055f3cef1dc69200301ec1fbed0c4e11653/keybase.md: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) (code=170)
✔ "richard.brooksby" on facebook failed: API network error: Get https://www.facebook.com/richard.brooksby/posts/10102993592483830: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) (code=170)

The log id for this is d3b7aec44f735fb9377e5f1c.

Please excuse my sense of humour. Something is too feeble to cope with a long path, but it's probably not the Keybase code.

rptb1 commented 6 years ago

Possible duplicate of issue https://github.com/keybase/client/issues/9039#issuecomment-340296285

rptb1 commented 6 years ago

Caused by this code https://github.com/keybase/client/blob/6f1e2c08d514713db6dc33081a09a6dcf9bd5493/go/libkb/socket_nix.go#L106-L122

rptb1 commented 6 years ago

So yes, something is too feeble. It's net.Dial https://golang.org/pkg/net/#Dial

rptb1 commented 6 years ago

OK, this has come back to me from about 25 years ago. I believe this is an underlying Unix restriction to do with packing all the various socket address families into a single fixed-length structure. In particular, the AF_UNIX address family structure has a fixed length path for the sun_path field. See man 7 unix. Go's runtime library is just passing on that restriction. I wrote software loooong ago that tried to create Unix sockets in home directories and had exactly the same problem.

You can't do it reliably.

The usual workaround is to create sockets somewhere with a guaranteed short path, such as /tmp/PID/foo.sock though there may be better solutions out there.

The reason we have inconsistent results in this thread (and that making symlinks kinda sorta works) is that my home directory + "Library/Keybase" is just under the limit.

I think you will come across this bug on every Posix system sooner or later.

mmaxim commented 6 years ago

We have code that changes directory to avoid this, I wonder if it is not working in this case for some reason. I'll take a look.

rptb1 commented 6 years ago

I'm not sure about this (this is old memory), but I don't think you can reliably use a relative path, so chdir might not help.

mmaxim commented 6 years ago

So this comment (https://github.com/keybase/keybase-issues/issues/3277#issuecomment-421082632) is a fundamentally different error than what you saw before. The command line program is talking to the service, but those errors indicate the service could not connect to any external service. This suggests to me it is actually working, the warnings should not be fatal.

mmaxim commented 6 years ago

It's also possible that Electron is not doing this chdir thing when it tries to connect to the socket, I think only Go is doing it currently.

rptb1 commented 6 years ago

This makes the menu bar widget recognise me as logged in:

rb@silverbird$ ln -s $HOME /tmp/foo
rb@silverbird$ HOME=/tmp/foo /Applications/Keybase.app/Contents/MacOS/Keybase

It also makes the "main app" get further. screen shot 2018-09-13 at 19 30 54

mmaxim commented 6 years ago

I'm coming around to the idea that the bug we need to fix here is that Electron cannot connect to the socket when it is in one of these long paths. For background, the reason we put the socket there is to support a Finder extension for KBFS, so moving it might not be in the cards, although I will need to double check that.

rptb1 commented 6 years ago

You might not need to move it, but instead access it via a symlink in /tmp. Perhaps the "dial" (or equivalent) function could spot the long path and create a symlink in /tmp in that case. Or always use one (better regression testing that way). For what it's worth, I think sometimes using chdir in untested corner cases is somewhat unreliable coding.

mmaxim commented 6 years ago

Ok we have a ticket internally for this now, will be fixed in the next release (a couple weeks). Thanks for all the great information!

rptb1 commented 6 years ago

It's been quite fun.