microsoft / WSL

Issues found on WSL
https://docs.microsoft.com/windows/wsl
MIT License
17.23k stars 808 forks source link

linkat fails with EACCES when the target inode has deleted original hardlink #3972

Closed trchen1033 closed 4 years ago

trchen1033 commented 5 years ago

Please fill out the below information:

$ touch foo
$ ln foo bar
$ rm -f foo
$ ln bar ham
ln: failed to create hard link 'ham' => 'bar': Permission denied

https://gist.github.com/trchen1033/9f004be23919f8f7a2ffff4b1da9bf7b

therealkenc commented 5 years ago

Cannot reproduce on either drvfs or lxfs. Sequence speaks for itself though, so if it is failing you are probably onto something. Best guess absent some me2s is you are missing some important CLI steps leading up to the touch foo.

trchen1033 commented 5 years ago

I confirm this does not affect drvfs. My / is mounted as wslfs though. Here's how my /proc/mounts look like:

rootfs on / type wslfs (rw,noatime)
none on /dev type tmpfs (rw,noatime,mode=755)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,noatime)
proc on /proc type proc (rw,nosuid,nodev,noexec,noatime)
devpts on /dev/pts type devpts (rw,nosuid,noexec,noatime,gid=5,mode=620)
none on /run type tmpfs (rw,nosuid,noexec,noatime,mode=755)
none on /run/lock type tmpfs (rw,nosuid,nodev,noexec,noatime)
none on /run/shm type tmpfs (rw,nosuid,nodev,noatime)
none on /run/user type tmpfs (rw,nosuid,nodev,noexec,noatime,mode=755)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)
cgroup on /sys/fs/cgroup type tmpfs (rw,relatime,mode=755)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,relatime,devices)
C:\ on /mnt/c type drvfs (rw,noatime,uid=0,gid=0,case=off)

The above repro was done in my home directory. /home/trchen, all permission setup looks normal. The same repro can be done as root too. I use a self installed Gentoo distro, though I don't think it's related to my distro or glibc. I haven't tried, but I can probably make a repro that makes syscalls directly instead.

therealkenc commented 5 years ago

I use a self installed Gentoo distro

[....redacted....] Of course you do.

I haven't tried, but I can probably make a repro that makes syscalls directly instead.

You can tee that up if you like. But before you take the trouble, variate some external variables -- I can't guess which. Your strace(1) log has a one-liner linkat() with 'bar' and 'ham' -- files which aren't breathed on before the one syscall. A test case with the one call won't elucidate much. A test case with the touch+ln+rm+ln sequence might. Hard to say. But I'd scratch head for an unspoken externality first, starting with "not Gentoo".

Below is my strace(1) log using ln(1) 8.28 and glibc 2.27-3ubuntu1 FWIW. Which is notable mostly for being identical save for the linkat() fail (on a blurry eyed glance anyway). So we're looking for something else. This on 18875 with the usual caveat.

openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=2931584, ...}) = 0
mmap(NULL, 2931584, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0334f34000
close(3)                                = 0
stat("ham", 0x7fffc24f1ae0)             = -1 ENOENT (No such file or directory)
lstat("bar", {st_mode=S_IFREG|0666, st_size=0, ...}) = 0
linkat(AT_FDCWD, "bar", AT_FDCWD, "ham", 0) = 0
therealkenc commented 5 years ago

Quick follow-up. Me (still 18875):

$ mount
rootfs on / type lxfs (rw,noatime)
[...]

That lxfs vs wslfs is different enough for me.

0xbadfca11 commented 5 years ago

I can reproduce this with build 18875's wslfs, not happen lxfs. (wslfs is currently default filesystem for VolFs for new distribution install.)

trchen1033 commented 5 years ago

I confirm. Tried on my other computer computer with the same distro except with lxfs. Very likely it's a wslfs bug.

bonki commented 4 years ago

I can also reproduce this with wslfs, here's a complete strace.

$ rm -f foo bar ham
$ strace -o wslfs_ln_fail.strace -f sh -c 'touch foo; ln foo bar; rm -f foo; ln bar ham'
ln: failed to create hard link 'ham' => 'bar': Permission denied
andreasstieger commented 4 years ago

Seen to break zypper on openSUSE Leap 15.1 on WSL: https://bugzilla.opensuse.org/show_bug.cgi?id=1159195 which also used hardlinks

therealkenc commented 4 years ago

This one appears to be addressed in WSL2 (which sports an ext4 filesystem).

A test case with the touch+ln+rm+ln sequence might.

Does. From #4816

image

lnussel commented 4 years ago

Under which circumstances resp since when is wslfs used as default file system? If it's the default for everyone now we have a serious problem in openSUSE as the package manager would not work anymore with the images we have in the store.

therealkenc commented 4 years ago

I'll ping the devs internally. The S/N ratio is pretty low so ones like this one sometimes get buried.

As a data point, was there any change in the package manager with respect to using (or not using) hardlinks as part of the process (in the last 18 months or so)? On best evidence the lxfs/wslfs difference points fairly conclusively to a regress; but it will be helpful to know if there is more than one variable in play.

lnussel commented 4 years ago

The affected code in libzypp does not look like it has changed the last two years: https://github.com/openSUSE/libzypp/blame/master/zypp/Fetcher.cc#L541 https://github.com/openSUSE/libzypp/blame/master/zypp/PathInfo.cc#L836

lnussel commented 4 years ago

short of a quick fix is there a workaround? Is there a way to downgrade to lxfs for example?

therealkenc commented 4 years ago

Not that I know of. The / mountpoint isn't under user control.

One could hypothetically work-around in the source, acknowledging that's icky. Or LD_PRELOAD a libzypp.so shim since hardlinkCopy() is public. Typing this blind into the message:

     std::string unlinkpath;
     if ( pi.isExist() )
      {
    // int res = unlink( newpath );
        unlinkpath = newpath + ".unlink";
        int res = ::rename(newpath.c_str(), unlinkpath.c_str());
    if ( res != 0 )
      return logResult( res );
      }

      // Here: no symlink, no newpath
      if ( ::link( oldpath.asString().c_str(), newpath.asString().c_str() ) == -1 )
      {
        switch ( errno )
        {
      case EPERM: // /proc/sys/fs/protected_hardlink in proc(5)
          case EXDEV: // oldpath  and  newpath are not on the same mounted file system
        MIL << " => copy" << endl;
            return copy( oldpath, newpath );
            break;
        }
        return logResult( errno );
      }
      if (unlinkpath.size()) {
        unlink(unlinkpath);
      }
      return logResult( 0 );

Which is to say, in the OP analogy, this works:

$ touch foo
$ ln foo bar
$ # rm -f foo
$ mv foo foo.unlink
$ ln bar ham
$ rm -f foo.unlink

Not advocating actually spinning a libzypp, natch; just that it's a plausible work-around.

lnussel commented 4 years ago

Too much for documenting a quick workaround. I was hoping for some simple command to enter on Windows side to downgrade to lxfs. This is a real bug in wslfs after all that also affects other legitimate work loads. #4066 for example also looks like it.

therealkenc commented 4 years ago

4066 is more likely #1529; can't tell because there's no repro or strace log over there. Bitbake has surfaced before #2665. Analogous everyone's favorite npm #14 except bitbake isn't as popular.

This one was novel because there's no handle open; but educated speculation would be a handle is open on the win32 side of the 9p service. 9p was release in 18342, circa February 2019; uncoincidentally included a couple months later in 18342 per the OP.

therealkenc commented 4 years ago

Can someone with a working WSL 1 try running the OP repro from an elevated ("run as administrator") cmd prompt, then wsl.exe -d OpenSUSE-Leap-15-1. Probably doesn't need to be SUSE. Have a working theory what broke which is probably incorrect, but it is worth a try.

[ed] nvm, I managed to get a WSL 1 live using --export / --import. No dice running elevated.

bonki commented 4 years ago

@therealkenc Is this fix part of 2004 or do we need to wait for the next release?

therealkenc commented 4 years ago

Right, Craig's fixinbound was Feb 12th which means there is no way this made 2004.

therealkenc commented 4 years ago

This one went fixinbound Feb 12th; let's call it amorphous "Stability improvements for virtio-9p (drvfs)" in 19640.

/fixed 19640

ghost commented 4 years ago

This bug or feature request originally submitted has been addressed in whole or in part. Related or ongoing bug or feature gaps should be opened as a new issue submission if one does not already exist.

Thank you!

yecril71pl commented 4 years ago

Is said 19640 for WSL2 only? Have you just discontinued WSL? 😲

therealkenc commented 4 years ago

Is said 19640 for WSL2 only?

No.

image

yecril71pl commented 4 years ago

I am not sure how relevant it is but I am on 19041 and the problem does not occur after upgrading to WSL 2.