moby / moby

The Moby Project - a collaborative project for the container ecosystem to assemble container-based systems
https://mobyproject.org/
Apache License 2.0
68.46k stars 18.63k forks source link

rename / move file fails on mounted volume in Windows Server 2019 #38256

Open rr-rosendahl opened 5 years ago

rr-rosendahl commented 5 years ago

Description

Renaming or moving files within mounted volumes fails with a "path not found" error. This happens when renaming the file on the command line: ren README.md README.md.bak As well as with powershell Move-Item (see the attached screenshot): Move-Item .\README.md .\README.md.bak Moving files in compiled code with the MoveFile function of the Windows API also results in an error (microsoft error code 3 - file not found).

So it looks like renaming/moving files seems to fail at the operating system level. Copying and deleting files works without problems.

Steps to reproduce the issue:

  1. docker run --name Test -it -v C:\Users\User\Documents\volume:c:/vs microsoft/windowsservercore cmd.exe
  2. docker attach Test
  3. ren c:\vs\MariaDB\README.md c:\vs\MariaDB\README.md.bak

Describe the results you received:

Moving / renaming the file resulted in an error message: Could not find a part of the path.

2018-11-22_screenshot

Describe the results you expected:

Successful renamed / moved file.

Additional information you deem important (e.g. issue happens only occasionally):

We stumbled upon this issue when we were trying to start a MariaDB database instance (with InnoDB engine) on the mounted volume. Under certain circumstances, the database tries to move files on startup and shutdown and then stops with an error message. We checked the MariaDB source code and noticed that they were using the Microsoft Windows API MoveFile function. We changed that to do a CopyFile (and DeleteFile) instead and that ran without error. That led us to check the rename commands on the shell and noticed that those also failed.

Output of docker version:

Client:
 Version:           18.03.1-ee-4
 API version:       1.37
 Go version:        go1.10.2
 Git commit:        0ded23c
 Built:             Thu Oct 25 00:41:52 2018
 OS/Arch:           windows/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.03.1-ee-4
  API version:      1.37 (minimum version 1.24)
  Go version:       go1.10.2
  Git commit:       0ded23c
  Built:            Thu Oct 25 00:56:17 2018
  OS/Arch:          windows/amd64
  Experimental:     false

Output of docker info:

Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 3
Server Version: 18.03.1-ee-4
Storage Driver: windowsfilter
 Windows:
Logging Driver: json-file
Plugins:
 Volume: local
 Network: ics l2bridge l2tunnel nat null overlay transparent
 Log: awslogs etwlogs fluentd gelf json-file local logentries splunk syslog
Swarm: inactive
Default Isolation: process
Kernel Version: 10.0 17763 (17763.1.amd64fre.rs5_release.180914-1434)
Operating System: Windows Server 2019 Standard Version 1809 (OS Build 17763.134)
OSType: windows
Architecture: x86_64
CPUs: 4
Total Memory: 3.978GiB
Name: DG-DOCKER2019
ID: WBWU:XTLJ:KHYI:OWMH:7LLI:ORWM:XAUU:YW2W:5LSP:5ZYH:VC7E:GQRW
Docker Root Dir: C:\ProgramData\docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

Windows Server 2019 Standard / Windows Docker Container Processisolation (without Hyper-V)

thaJeztah commented 5 years ago

ping @jhowardmsft

rr-rosendahl commented 5 years ago

I edited the description to include a note that copying and deleting files works without problems (only rename/move fails). I also added some background that we were facing that issue when running a database instance on the mounted volume.

rr-rosendahl commented 5 years ago

By the way - the same issue exists with Windows Server 2016 Standard / Windows Docker Container Processisolation (without Hyper-V).

drnybble commented 5 years ago

Maybe related? https://github.com/moby/moby/issues/30512

olljanat commented 5 years ago

@exodusd sounds like like one more Windows containers weakness like there is many others.

Can you share some real world use case where this matters?

rr-rosendahl commented 5 years ago

Yes, i described the real world scenario in the "Additional Information" section. We're mounting a MariaDB database folder into the container. There are circumstances where MariaDB will move/rename InnoDB files on startup, which fails because of the error, resulting in the database engine not starting up. That's actually how we found the problem - we just described the simpler test case with manually renaming a file to make it easier to reproduce. ;-)

olljanat commented 5 years ago

@exodusd I highly recommended you to use official MariaDB image top of Linux version of Docker and only use Windows containers for workloads which are not able run on Linux.

How ever if you have good reason to use Windows containers I highly recommended that you open ticket to Microsoft Premier Support about this because afaik it is caused by weakness/bug on Windows it selves instead of Docker.

rr-rosendahl commented 5 years ago

Well, we're actually fixed on windows for that one. But we've done some research and found that if the change the container isolation level from "process" to "hyperv", then the error doesn't occur. It's not perfect, but at least it's not broken. ;-)

lowenna commented 5 years ago

@exodusd

I just tried this out use docker binaries based on moby/moby master (can download from master.dockerproject.org, but I built locally) and was unable to repro. Using either Hyper-V or Process isolation.

Running on RS5/Server 2019/Build 17763. (I was on patch release 253, but noticed you were on a slightly older 134, so there may have been some servicing fixes you don't have yet)

I downloaded mariadb-10.3.12.winx64.zip from https://downloads.mariadb.org/mariadb/10.3.12/ and expanded it to c:\mariadb on the host.

docker run -it --rm [--isolation=hyperv] -v c:\mariadb:c:\mariadb microsoft/windowsservercore cmd

(In container) cd \mariadb ren README.md README.md.bak (and then back again)

It's possible what you're seeing might be a combination of a docker bug which has since been fixed, or a servicing fix in Server 2019 which wasn't present in patch release 134.

Could you try both updating your machine to the latest servicing patches, and try the binaries from moby master to verify if this still seems to be an issue.

Thanks

EDIT: I also tried this on RS1/Windows Server 2016 and was similarly unable to repro.

rr-rosendahl commented 5 years ago

@jhowardmsft Thanks a lot for trying to reproduce the issue. We just tried again by just updating Microsoft Windows Server 2019 to the latest version and we also couldn't reproduce the issue. It seems that Microsoft has fixed it in one of the recent updates!

Thanks again for taking the time to check! We probably wouldn't have tried process isolation again, but now we can use that and avoid the overhead of hyperv isolation. :-)

rr-rosendahl commented 5 years ago

Hi! Unfortunately the problem is back again... We're now running Windows Server Core 2019 LTS (version 10.0.17763.379) in process isolation mode. When renaming a file in a subfolder of a mounted directory, the system tells us that the path could not be found (see the attached screenshot).

rename-bug

rr-rosendahl commented 5 years ago

We were finally able to narrow down the problem. It only seems to occur when File System Resource Management (dirquota) has been activated. Once a quota is activated, rename/move in subdirectories of mounted directories fails with "The system cannot find the path specified." The problem persists even after removing the quota and restarting the container, the File System Resource Management service or the docker daemon. Only after a reboot of the whole operating system, the problem doesn't occur. Once a quota is added, the renames/moves fail again.

This is the setup that we used to reproduce the issue:

We took video of the steps to reproduce the issue: https://www.youtube.com/watch?v=aY2vnzC76_M

@jhowardmsft could you try to reproduce the issue in process isolation with activated File System Resource Management and a dir quota?

JeffCyr commented 5 years ago

cc @exodusd, @jhowardmsft

I can reproduce the issue on

I had the issue originally with Docker CE 19.03.1. I uninstalled and cleaned Docker then installed Docker EE manually hoping this would resolve the issue, but it didn't.

I don't have File System Resource Management or Quotas configured on my machine.

Here is a 100% repro step:

Step 1

Start a new container

docker run -it --isolation process -v C:\data mcr.microsoft.com/windows/servercore:1903 powershell

Then run these statements inside the container

$ErrorActionPreference = "Stop"
cd C:\data
New-Item test.txt
while ($true) {
    Write-Host (Get-Date -DisplayHint Time)
    Move-Item test.txt test2.txt
    Start-Sleep -Seconds 1
    Move-Item test2.txt test.txt
    Start-Sleep -Seconds 1
}

Step 2

In another prompt, start a new container and exit

docker run --rm -it mcr.microsoft.com/windows/servercore:1903 powershell
PS C:\> exit

Result

As soon as the second container exits, Move-Item will fail on the first container with this error:

Move-Item : Could not find a part of the path.
At line:3 char:5
+     Move-Item test.txt test2.txt
+     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : WriteError: (C:\data\test.txt:FileInfo) [Move-Item], DirectoryNotFoundException
    + FullyQualifiedErrorId : MoveFileInfoItemIOError,Microsoft.PowerShell.Commands.MoveItemCommand

I hit this issue in my service where we use File.Replace to update a dynamic config file stored in a volume.

I can't reproduce the issue with hyperv isolation


Edit

I wasn't able to reproduce on other machines. I might end up reimaging my workstation to fix this issue. I can provide additional logs if needed

thaJeztah commented 5 years ago

/cc @ddebroy @jterry75

JeffCyr commented 4 years ago

I ended up re-imaging my workstation with a fresh install of Windows 10 1903, it solved the issue for a while but now it is back with the same repro steps as my post above.

I wonder if its related with the size of the windowsfilter folder.

I will gladly provide more logs if required.

docker version

Client: Docker Engine - Enterprise
 Version:           19.03.2
 API version:       1.40
 Go version:        go1.12.8
 Git commit:        c92ab06ed9
 Built:             09/03/2019 16:38:11
 OS/Arch:           windows/amd64
 Experimental:      false

Server: Docker Engine - Enterprise
 Engine:
  Version:          19.03.2
  API version:      1.40 (minimum version 1.24)
  Go version:       go1.12.8
  Git commit:       c92ab06ed9
  Built:            09/03/2019 16:35:47
  OS/Arch:          windows/amd64
  Experimental:     false

docker info

 Debug Mode: false

Server:
 Containers: 3
  Running: 3
  Paused: 0
  Stopped: 0
 Images: 29
 Server Version: 19.03.2
 Storage Driver: windowsfilter
  Windows:
 Logging Driver: json-file
 Plugins:
  Volume: local
  Network: ics l2bridge l2tunnel nat null overlay transparent
  Log: awslogs etwlogs fluentd gcplogs gelf json-file local logentries splunk syslog
 Swarm: inactive
 Default Isolation: process
 Kernel Version: 10.0 18362 (18362.1.amd64fre.19h1_release.190318-1202)
 Operating System: Windows 10 Enterprise Version 1903 (OS Build 18362.356)
 OSType: windows
 Architecture: x86_64
 CPUs: 8
 Total Memory: 31.89GiB
 Name: *******
 ID: **************
 Docker Root Dir: C:\ProgramData\docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
marcanpilami commented 4 years ago

This problem also seems to happen on Windows 10 1909 (version 18362.752, i.e. latest at the time of writing), either with Docker EE 19.03.5 or Docker for Desktop 19.03.8. As well as on server 1909 with EE 19.03.5.

My repro is even simpler, also with two successive containers:

docker run -it --rm --user ContainerAdministrator --isolation process -v test:c:\TEMP mcr.microsoft.com/powershell:nanoserver-1909 pwsh -Command 'mkdir /TEMP/SUBDIR; echo a > /TEMP/SUBDIR/a'
docker run -it --rm --user ContainerAdministrator --isolation process -v test:c:\TEMP mcr.microsoft.com/powershell:nanoserver-1909 pwsh -Command 'mv /TEMP/SUBDIR/a /TEMP/SUBDIR/b'
> Move-Item: Could not find a part of the path.

Please note that reproduction only works in a subdirectory of the volume - on the mount point (i.e. C:\TEMP in my case) everything works.

Creating a new container allows to always reproduce the issue, however it also happens without this in the initial container after some activity (hard to pinpoint exactly: I've also found the issue after a database failed to rotate its logs, inside a brand new container and volume).

As for the other posts on this thread, it only happens in process isolation mode, but switching to Hyper-V mode is not a possibility in our case. Also, no issues with removing the files, so this really seems to be the same issue.

No FSRM or other quotas on my machines.

Finally, when running a procmon trace... everything works dandily! This is quite a bummer, and a likely indication that there is a timing issue somewhere, as this is a common cause of "the debugger fixed it".

@thaJeztah is this issue still being worked on? I would also be glad to provide any required information if needed.

thaJeztah commented 4 years ago

@marcanpilami I don't think someone is working on this currently.

/cc @kevpar @taylorb-microsoft

JeffCyr commented 4 years ago

I am also still reproducing this issue on multiple machines. It could be related to the size of windowsfilter. I don`t have the issue on a freshly installed computer, than I started having the issue again after pulling/building large containers

rcwoolley commented 4 years ago

I happened to be listening to this episode of the Kubernetes Podcast which talks about Windows Containers: https://kubernetespodcast.com/episode/070-windows-server-containers/

They discussed the problem with GUID for user accounts being unique between instances of Windows. Is it possible that the ACLs for when the volume was created don't match the ACLs of the new container instance?

Does anyone know how to check this information from the CLI?

marcanpilami commented 4 years ago

@rcwoolley : as far as I know, ID stability is only an issue when creating new users inside an image or container. Windows has the concept of "well known user ids", and there are two users in this category that were created for docker: ContainerUser & ContainerAdministrator. My reproduction uses ContainerAdministrator (which in turn is a member of the well known group Administators).

I do not think this could be related - after all, we can delete the files for exemple, and the issue also appears by itself after a while in the same container. But thanks for trying to make this issue progress!

s-jeckel-mxn commented 4 years ago

We are now having this issue on one of our machines as well. It used to work until a few days ago.

Docker 2.2.3.0 Windows 10 1909 18363.720

Some observations:

If I can provide you with any information to narrow this down, please let me know.

hongyu69 commented 3 years ago

We ran into exactly the same file move failure on mounted volume when using Docker EE 19.03.11 on Windows server 2019 1809 17763.1397, which is blocking our log rolling inside containers.

@thaJeztah do you know who is the right owner of this issue? Thanks

thaJeztah commented 3 years ago

@hongyu69 from the discussion above, looks like it may be a platform issue (something in Windows itself); perhaps @kevpar or @taylorb-microsoft knows more.

mistoll commented 1 year ago

Same Problem here.

System:

Repro: On host:

> docker run -it --isolation=process -v C:\data:C:\test mcr.microsoft.com/windows/servercore:ltsc2022 cmd

On container:

C:\test>mkdir foo

C:\test>mkdir bar

C:\test>move foo bar
The system cannot find the path specified.

C:\test>dir
 Volume in drive C has no label.
 Volume Serial Number is xxxxxxx

 Directory of C:\test

01/19/2023  10:11 AM    <DIR>          .
01/19/2023  10:11 AM    <DIR>          bar
01/19/2023  10:11 AM    <DIR>          foo
               0 File(s)              0 bytes
               3 Dir(s)  783,761,960,960 bytes free

C:\test>

When using --isolation=hyperv the problem does not occur. When using a volume on drive D: the problem does not occur. No other containers are running. Used a fresh pull of the image.

davhdavh commented 1 year ago

Is there a workaround for this issue?

thaJeztah commented 1 year ago

Don't think so; this looks like an issue in Windows itself, and not the code in this repository. From issues linked to this ticket (in Microsoft's issue tracker), it look like they had problems reproducing the issue in the past; e.g. see this ticket (but there's other tickets linked to that one as well in their issue tracker);

mistoll commented 1 year ago

Is there a workaround for this issue?

Well, kind of. We moved the host directory to a separate partition. docker run -v C:\data:D:\test ... should work.