microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.27k stars 812 forks source link

[WSL2] [Interop] Keep a single shared /run/WSL/* socket #5065

Open paulstelian97 opened 4 years ago

paulstelian97 commented 4 years ago

Please use the following bug reporting template to help produce issues which are actionable and reproducible, including all command-line steps necessary to induce the failure condition. Please fill out all the fields! Issues with missing or incomplete issue templates will be closed.

If this is a console issue (a problem with layout, rendering, colors, etc.), please post to the console issue tracker.

Important: Do not open GitHub issues for Windows crashes (BSODs) or security issues. Please direct all Windows crashes and security issues to secure@microsoft.com. Ideally, please configure your machine to capture minidumps, repro the issue, and send the minidump from "C:\Windows\minidump\".\

See our contributing instructions for assistance.

Please fill out the below information:

Launch a new console of a WSL2 distro (either in Windows Terminal, wsl.exe directly or anything else). Then launch either tmux or gnome-terminal. Then close the original console. If I return to the tmux session and try to run a Windows executable, or if I attempt this in gnome-terminal, I get the following error:

<3>init: (11770) ERROR: UtilConnectToInteropServer:300: connect failed 2 This is because the $WSL_INTEROP variable points to a socket which was deleted (when the original terminal was closed) * What's wrong / what should be happening instead: Be able to use Interop even in that situation. * Strace of the failing command, if applicable: (If `some_command` is failing, then run `strace -o some_command.strace -f some_command some_args`, and link the contents of `some_command.strace` in a [gist](https://gist.github.com/) here). The only mildly relevant line sums up as: connect(fd, {AF_UNIX, sun_path="/run/WSL/11067_interop"}, 110) = -1 ENOENT. * For WSL launch issues, please [collect detailed logs](https://github.com/Microsoft/WSL/blob/master/CONTRIBUTING.md#8-detailed-logs).
therealkenc commented 4 years ago

Alright, gleaning the repro, you're in a state like this:

image

You can see /init listening with lsof. Here the client's /init parent is 173:

image

As a work-around, you can set the $WSL_INTEROP variable manually.

image

Perhaps the binfmt interpreter instance of /init could walk up the process tree itself, automating the aformentioned. Eliminate $WSL_INTEROP. [Just spitballing.]

paulstelian97 commented 4 years ago

Workaround works, of course, but unless there is a way I can somehow automate it it doesn't matter.

lvlts commented 4 years ago

I have this function in my .zshrc, and I call it on every shell init:

fix_wsl2_interop() {
    for i in $(pstree -np -s $$ | grep -o -E '[0-9]+'); do
        if [[ -e "/run/WSL/${i}_interop" ]]; then
            export WSL_INTEROP=/run/WSL/${i}_interop
        fi
    done
}

So far, so good, it works all the time.

Edit: fixed the regex to include 0 as @zakandrewking pointed out.

zakandrewking commented 4 years ago

thanks @lvlts, this helps me. I'd suggest a small change: grep -o -E '[0-9]+'

paulstelian97 commented 4 years ago

I have this function in my .zshrc, and I call it on every shell init:

fix_wsl2_interop() {
    for i in $(pstree -np -s $$ | grep -o -E '[0-9]+'); do
        if [[ -e "/run/WSL/${i}_interop" ]]; then
            export WSL_INTEROP=/run/WSL/${i}_interop
        fi
    done
}

So far, so good, it works all the time.

Edit: fixed the regex to include 0 as @zakandrewking pointed out.

Stole that, added in my own .bashrc. I'll just call it manually if I see issues.

str4d commented 4 years ago

If anyone is arriving here and needs a similar function for nushell (like I did):

alias fix-wsl2-interop --save [] {
  echo $nu.env | insert WSL_INTEROP $(
    pstree -np -s | grep -o -E '[0-9]+' | lines | each {
      build-string /run/WSL/ $it _interop
    } | split column "|" path | insert exists {
      get path | path exists
    } | where $it.exists | get path
  ) | config set_into env
}
elucidsoft commented 3 years ago

This is freaking odd, I had a previous build with WSL 2 exact same source code, everything the same. I reinstalled windows and recreated the setup, and encountered this issue without explanation. The same setup as before, so what made this happen this time vs. last time? I have no idea, as I literally set both environments up following the exact same procedures.

paulstelian97 commented 3 years ago

This is freaking odd, I had a previous build with WSL 2 exact same source code, everything the same. I reinstalled windows and recreated the setup, and encountered this issue without explanation. The same setup as before, so what made this happen this time vs. last time? I have no idea, as I literally set both environments up following the exact same procedures.

@elucidsoft Are you using stuff like Gnome Terminal or other GUI apps? The /run/WSL/* sockets remain valid as long as the wsl.exe or bash.exe (or ubuntu.exe etc) commands that you manually call remain valid. I encounter the issue whenever using gnome-terminal since that one allows the original normal or wt.exe terminal executable to close and invalidate the socket.

elucidsoft commented 3 years ago

I am using VSCode, but what's odd this never happened to me once in my previous setup which should have been identical.

fsackur commented 3 years ago

...and for any powershellers, a similar function for pwsh:

function Reset-WslInterop
{
    param ($ProcessId = $PID)

    if (Test-Path /run/WSL/$ProcessId`_interop)
    {
        $env:WSL_INTEROP="/run/WSL/$ProcessId`_interop"
        return
    }

    Reset-WslInterop (Get-Process -Id $ProcessId).Parent.Id
}
elucidsoft commented 3 years ago

...and for any powershellers, a similar function for pwsh:

function Reset-WslInterop
{
    param ($ProcessId = $PID)

    if (Test-Path /run/WSL/$ProcessId`_interop)
    {
        $env:WSL_INTEROP="/run/WSL/$ProcessId`_interop"
        return
    }

    Reset-WslInterop (Get-Process -Id $ProcessId).Parent.Id
}

Yes, I don't use Powershell but I did this in bash script as the solution.

lyf2000 commented 3 years ago

I've encountered with same error with my WSL2, when was trying to run docker-compose build - to rebuild the containers. The eror was:

newschool-db uses an image, skipping
newschool-redis uses an image, skipping
Building newschool-api
<3>init: (671) ERROR: UtilConnectToInteropServer:300: connect failed 2
Traceback (most recent call last):
  File "docker/credentials/store.py", line 80, in _execute
  File "subprocess.py", line 411, in check_output
  File "subprocess.py", line 512, in run
subprocess.CalledProcessError: Command '['/mnt/c/Program Files/Docker/Docker/resources/bin/docker-credential-desktop.exe', 'list']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bin/docker-compose", line 3, in <module>
  File "compose/cli/main.py", line 67, in main
  File "compose/cli/main.py", line 126, in perform_command
  File "compose/cli/main.py", line 302, in build
  File "compose/project.py", line 468, in build
  File "compose/project.py", line 450, in build_service
  File "compose/service.py", line 1125, in build
  File "docker/api/build.py", line 261, in build
  File "docker/api/build.py", line 308, in _set_auth_headers
  File "docker/auth.py", line 302, in get_all_credentials
  File "docker/credentials/store.py", line 71, in list
  File "docker/credentials/store.py", line 93, in _execute
docker.credentials.errors.StoreError: Credentials store docker-credential-desktop.exe exited with "".
[668] Failed to execute script docker-compose

But later I just run it as root and that resovled the trouble It's likely the solution for specific issues, but hope will help someone)

meymeynard commented 3 years ago

I've encountered with same error with my WSL2, when was trying to run docker-compose build - to rebuild the containers. The eror was:

newschool-db uses an image, skipping
newschool-redis uses an image, skipping
Building newschool-api
<3>init: (671) ERROR: UtilConnectToInteropServer:300: connect failed 2
Traceback (most recent call last):
  File "docker/credentials/store.py", line 80, in _execute
  File "subprocess.py", line 411, in check_output
  File "subprocess.py", line 512, in run
subprocess.CalledProcessError: Command '['/mnt/c/Program Files/Docker/Docker/resources/bin/docker-credential-desktop.exe', 'list']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bin/docker-compose", line 3, in <module>
  File "compose/cli/main.py", line 67, in main
  File "compose/cli/main.py", line 126, in perform_command
  File "compose/cli/main.py", line 302, in build
  File "compose/project.py", line 468, in build
  File "compose/project.py", line 450, in build_service
  File "compose/service.py", line 1125, in build
  File "docker/api/build.py", line 261, in build
  File "docker/api/build.py", line 308, in _set_auth_headers
  File "docker/auth.py", line 302, in get_all_credentials
  File "docker/credentials/store.py", line 71, in list
  File "docker/credentials/store.py", line 93, in _execute
docker.credentials.errors.StoreError: Credentials store docker-credential-desktop.exe exited with "".
[668] Failed to execute script docker-compose

But later I just run it as root and that resovled the trouble It's likely the solution for specific issues, but hope will help someone)

@lyf2000 Is there any way this could run without the sudo?

paulstelian97 commented 3 years ago

I've encountered with same error with my WSL2, when was trying to run docker-compose build - to rebuild the containers. The eror was:

newschool-db uses an image, skipping
newschool-redis uses an image, skipping
Building newschool-api
<3>init: (671) ERROR: UtilConnectToInteropServer:300: connect failed 2
Traceback (most recent call last):
  File "docker/credentials/store.py", line 80, in _execute
  File "subprocess.py", line 411, in check_output
  File "subprocess.py", line 512, in run
subprocess.CalledProcessError: Command '['/mnt/c/Program Files/Docker/Docker/resources/bin/docker-credential-desktop.exe', 'list']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bin/docker-compose", line 3, in <module>
  File "compose/cli/main.py", line 67, in main
  File "compose/cli/main.py", line 126, in perform_command
  File "compose/cli/main.py", line 302, in build
  File "compose/project.py", line 468, in build
  File "compose/project.py", line 450, in build_service
  File "compose/service.py", line 1125, in build
  File "docker/api/build.py", line 261, in build
  File "docker/api/build.py", line 308, in _set_auth_headers
  File "docker/auth.py", line 302, in get_all_credentials
  File "docker/credentials/store.py", line 71, in list
  File "docker/credentials/store.py", line 93, in _execute
docker.credentials.errors.StoreError: Credentials store docker-credential-desktop.exe exited with "".
[668] Failed to execute script docker-compose

But later I just run it as root and that resovled the trouble It's likely the solution for specific issues, but hope will help someone)

@lyf2000 Is there any way this could run without the sudo?

  1. Ensure your user is in the Docker group
  2. At the beginning of any session you shall do "sg docker" to launch a shell that has the proper group, since normal logins won't load these supplemental groups from /etc/passwd I think.
sfmontyo commented 3 years ago

So I get this issue repeatedly because I like to use the gnome-terminal instead of Windows Terminal. Here's the steps to reproduce:

  1. Run Windows Terminal and use your favorite WSL2 distro.
  2. Within the WSL2 distro, start gnome-terminal . (You'll need to have an X Server on your Windows Host).
  3. In the new gnome-terminal, execute the command:
    /mnt/c/Windows/System32/cmd.exe /C "echo %USERPROFILE%"

    NOTE: that it works and prints out the contents of the Windows Environment Variable USERPROFILE.

    Also note the contents of the WSL_INTEROP bash environment variable points to /var/run/NNNN_interop or something like that

  4. Close the original Windows Terminal. This frees up the WSL instance
  5. Go back to the gnome-terminal and re-execute the above command. It will fail with the error:
    ERROR: UtilConnectToInteropServer:300: connect failed 2
  6. Now start up a new Windows Terminal with a new WSL2 instance. In that terminal, see what the contents of WSL_INTEROP is
    echo $WSL_INTEROP
  7. Go back to the gnome-terminal and change that WSL_INTEROP variable to use the new value from the new Windows Terminal
  8. Reexecute the command and it works again since its now using the new filesystem socket.
paulstelian97 commented 3 years ago

The fact that the socket is per-tty rather than, say, per-Windows session or something, is annoying. I would have preferred to always run Gnome Terminal and launch new terminals from that one always.

JeppeKlitgaard commented 3 years ago

I'll leave my version for fish, shamelessly translated from lvlts' bash solution above.

# .config/fish/conf.d/wsl.fish

function fix_wsl2_interop
    for i in (pstree -np -s $fish_pid | grep -o -E '[0-9]+')
        if test -e "/run/WSL/"$i"_interop"
            set -x WSL_INTEROP "/run/WSL/"$i"_interop"
        end
    end
end

fix_wsl2_interop
plato79 commented 3 years ago

Ok, the solutions listed here doesn't work for me.. I'm using Ubuntu 20.04 and when I check with pstree -p I couldn't see any values resembling the number in the "_interop" file.. I don't know the cause but I think we should find another way to find this number.

paulstelian97 commented 3 years ago

Ok, the solutions listed here doesn't work for me.. I'm using Ubuntu 20.04 and when I check with pstree -p I couldn't see any values resembling the number in the "_interop" file.. I don't know the cause but I think we should find another way to find this number.

You should see if anything at all exists in /run/WSL/. If not, then it's because there's simply no open native terminal.

joedborg commented 3 years ago

Ok, the solutions listed here doesn't work for me.. I'm using Ubuntu 20.04 and when I check with pstree -p I couldn't see any values resembling the number in the "_interop" file.. I don't know the cause but I think we should find another way to find this number.

I've done it with adding this to my .zshenv

export WSL_INTEROP="/run/WSL/$(ls -tr /run/WSL | head -n1)"

No idea how robust this is going to be, but the oldest socket seems to be the one for me.

plato79 commented 3 years ago

I think this would work.. Because if you think about it, whenever you open a new instance a new file is generated and it will be the file we need to assign to the environment variable. Thanks.

Edit: Just realized, I think you shouldn't reverse order. Because ls -t lists files from newest first. So reversing returns oldest first.

joedborg commented 3 years ago

Edit: Just realized, I think you shouldn't reverse order. Because ls -t lists files from newest first. So reversing returns oldest first.

On my machine, the oldest one is the one that always seems to work. It's all the new ones that are created, per shell, that don't. Perhaps we have different issues that manifest the same?

plato79 commented 3 years ago

Edit: Just realized, I think you shouldn't reverse order. Because ls -t lists files from newest first. So reversing returns oldest first.

On my machine, the oldest one is the one that always seems to work. It's all the new ones that are created, per shell, that don't. Perhaps we have different issues that manifest the same?

Well, using the oldest also works. Although I'm using this for code . command mostly.. If you use for anything else it could matter.

joedborg commented 3 years ago

Edit: Just realized, I think you shouldn't reverse order. Because ls -t lists files from newest first. So reversing returns oldest first.

On my machine, the oldest one is the one that always seems to work. It's all the new ones that are created, per shell, that don't. Perhaps we have different issues that manifest the same?

Well, using the oldest also works. Although I'm using this for code . command mostly.. If you use for anything else it could matter.

Yeah, as I mentioned, no idea how robust this will be, it's just a work around until we can get a real one in WSL (fingers crossed).

paulstelian97 commented 3 years ago

I think any of the sockets work, so you just need a heuristic to select the one that will take the longest to be closed. Could be oldest, could be newest, could be any of the others. When you close one of the native terminals (wt, bash.exe, ubuntu.exe etc) the corresponding socket will be both closed and deleted from that path.

joedborg commented 3 years ago

I think any of the sockets work, so you just need a heuristic to select the one that will take the longest to be closed. Could be oldest, could be newest, could be any of the others. When you close one of the native terminals (wt, bash.exe, ubuntu.exe etc) the corresponding socket will be both closed and deleted from that path.

In my specific case, only the lowest numbered works.

neerolyte commented 3 years ago

I wanted a solution that would work automatically in existing shells, not just new ones (or by manually executing a fix). Others may want this solution too.

Running the function from above within a PROMPT_COMMAND means WSL_INTEROP will be reset every time the prompt is updated (after each command invocation).

I've also optimised the function a little to short circuit the run if possible and avoid the extra proccess call (grep).

prompt_fix_wsl() {
    # return early if WSL is missing or already working
    [[ -n "$WSL_INTEROP" ]] || return
    ! [[ -e "$WSL_INTEROP" ]] || return
    local pid pids
    # parse pstree output in to pids array
    IFS='-()'
    # shellcheck disable=SC2207
    pids=($(pstree --numeric-sort --show-pids --show-parents $$))
    unset IFS
    for pid in "${pids[@]}"; do
        [[ "$pid" =~ [0-9]+ ]] || continue
        [[ -e "/run/WSL/${pid}_interop" ]] || continue
        export "WSL_INTEROP=/run/WSL/${pid}_interop"
        # stop looking for sockets
        return
    done
}

Setting it up like so:

function my_prompt_command {
    # .... my existing prompt command that's probably too long to justify the above optimisation ...
    prompt_fix_wsl
}

PROMPT_COMMAND=my_prompt_command

WSL_INTEROP is now restored every time a command is executed in bash.

marwatk commented 3 years ago

All of these workarounds assume you don't need to keep the original interop socket open (because it's in use). When using something like wsl-vpnkit to allow WSL2 networking under a VPN it launches a persistent windows process, but that process gets terminated when the shell that started it is closed.

While it's possible to detect and remediate the closure it would be much better if there was just a single socket instead of one per-tty.

sleeperss commented 3 years ago

all of this solutions doesn't work for https://github.com/shayne/wsl2-hacks, as terminal pid isn't accessible anymore

I come up with this solution :

#!/usr/bin/bash

export WSL_INTEROP=
for socket in /run/WSL/*; do
   if ss -elx | grep -q "$socket"; then
      export WSL_INTEROP=$socket
   else
      rm $socket 
   fi
done

if [[ -z $WSL_INTEROP ]]; then
   echo -e "\033[31mNo working WSL_INTEROP socket found !\033[0m" 
fi
ElMehdi-TouimiBenjelloun commented 2 years ago

all of this solutions doesn't work for https://github.com/shayne/wsl2-hacks, as terminal pid isn't accessible anymore

I come up with this solution :

#!/usr/bin/bash

export WSL_INTEROP=
for socket in /run/WSL/*; do
   if ss -elx | grep -q "$socket"; then
      export WSL_INTEROP=$socket
   else
      rm $socket 
   fi
done

if [[ -z $WSL_INTEROP ]]; then
   echo -e "\033[31mNo working WSL_INTEROP socket found !\033[0m" 
fi

Thanks, it works like a charm.

techtheriac commented 1 year ago

export WSL_INTEROP="/run/WSL/$(ls -tr /run/WSL | head -n1)"

This works for me.

SuperSandro2000 commented 1 year ago

Ok, the solutions listed here doesn't work for me.. I'm using Ubuntu 20.04 and when I check with pstree -p I couldn't see any values resembling the number in the "_interop" file.. I don't know the cause but I think we should find another way to find this number.

I've done it with adding this to my .zshenv

export WSL_INTEROP="/run/WSL/$(ls -tr /run/WSL | head -n1)"

No idea how robust this is going to be, but the oldest socket seems to be the one for me.

that breaks on wsl2

WingTillDie commented 1 year ago

Fix by search lowest ancestor of tmux: client that is init Then sets wsl interop id as pid of that init

#!/usr/bin/bash
fix_wsl_interop_tmux() {
    ancestors_of_tmux_client() {
        pstree --show-pids | fgrep 'tmux: client'; }
    get_pid_of_lowest_ancestor_init() {
        select_numbers() {
            grep -oP '\d+'; }
        select_line_-3(){
            tail -n3 | head -n1; }
        select_numbers | select_line_-3
    }
    REAL_WSL_INTEROP_ID=$(ancestors_of_tmux_client | get_pid_of_lowest_ancestor_init)
    sudo ln -s /run/WSL/${REAL_WSL_INTEROP_ID}_interop $WSL_INTEROP
}
fix_wsl_interop_tmux