golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.4k stars 17.59k forks source link

runtime: exit status 0xC0000005 from test process on windows-amd64-longtest builder #38440

Closed bcmills closed 4 years ago

bcmills commented 4 years ago

Observed in 2020-04-10T16:24:46-ea7126f/windows-amd64-longtest:

windows-amd64-longtest …
…
--- FAIL: TestTestCache (4.15s)
    go_test.go:2405: 

        INITIAL

    go_test.go:2413: running testgo [test -x -v -short t/...]
    go_test.go:2413: standard output:
    go_test.go:2413: exit status 3221225477
        FAIL    t/t1    0.486s
        exit status 3221225477
        FAIL    t/t2    0.481s
        === RUN   Test3
            Test3: t3_test.go:6: 1
        --- PASS: Test3 (0.00s)
        PASS
        ok      t/t3    0.462s
        === RUN   Test4
            Test4: t4_test.go:6: 1
        --- PASS: Test4 (0.00s)
        PASS
        ok      t/t4    0.467s
        FAIL

    go_test.go:2413: standard error:

The FAIL in t/t1 is unexpected in this test.

Exit status 3221225477 is 0xC0000005, which some cursory searching seems to suggest is a generic “access violation” error. That suggests possible memory corruption.

I haven't seen any repeats of this error so far.

CC @alexbrainman @zx2c4

aclements commented 4 years ago

CL 235639 on top of near-current master (6851a55f28ae88f4c9809b241d737fe65a977dd3) ran 50 times without reproducing the issue, so that CL alone seems like a sufficient fix.

aclements commented 4 years ago

I spent all night fighting with Windows trying to get an ounce of useful diagnostics out to understand the underlying issue and I'm ready to throw in the towel. I'm trying to get a user crash dump captured so I can windbg it, but I'm not sure if that's even a reasonable thing to do and I'm just not getting any files in my crash dump directory. Could a Windows expert tell me what I'm doing wrong?

I set up a windows-amd64-longtest builder according to https://docs.microsoft.com/en-us/windows/win32/wer/collecting-user-mode-dumps:

mkdir c:\workdir\CrashDumps
reg add "HKLM\SOFTWARE\Microsoft\Windows\Windows Error Reporting\LocalDumps" /reg:64
reg add "HKLM\SOFTWARE\Microsoft\Windows\Windows Error Reporting\LocalDumps" /v DumpFolder /t REG_EXPAND_SZ /d c:\workdir\CrashDumps /reg:64
reg add "HKLM\SOFTWARE\Microsoft\Windows\Windows Error Reporting\LocalDumps" /v DumpCount /t REG_DWORD /d 10 /reg:64
reg add "HKLM\SOFTWARE\Microsoft\Windows\Windows Error Reporting\LocalDumps" /v DumpType /t REG_DWORD /d 1 /reg:64
reg query "HKLM\SOFTWARE\Microsoft\Windows\Windows Error Reporting\LocalDumps" /reg:64

And run gomote run -e GO_TEST_SHORT=0 -e GO_TEST_TIMEOUT_SCALE=5 $VM go/src/all.bat but c:\workdir\CrashDumps is just empty after I reproduce the issue.

alexbrainman commented 4 years ago

I don't use windbg or crash dumps, so don't assume I am an expert. @zx2c4 is your man.

I'm trying to get a user crash dump captured so I can windbg it, but I'm not sure if that's even a reasonable thing to do ...

From what I understand, you will be able to see process state at the time of the crash. That assumes that the program exists due to Windows exception. And you will need the executable file itself too to use with your crash dump.

I am not sure what that buys you. You won't be able to use any Go symbols - windbg does not understand DWARF. You will see all Windows DLL functions.

... and I'm just not getting any files in my crash dump directory.

Assuming our program raises Windows exception, you need to adjust Go runtime to let that exception out of Go process. Currently Go does not let exceptions out of the process. Go prints stack trace, if it cannot handle exception.

There is #20498 to adjust Go to allow for exceptions to be handled externally.

https://go-review.googlesource.com/c/go/+/195577/ is latest attempt to solve that problem.

Once you adjust runtime, then you will need to configure your system to handle exception as crash dump. I don't know how to do it, so your guess is as good as mine. Perhaps your instructions above are fine.

Hopefully it is any help.

Alex

zx2c4 commented 4 years ago

ZwFlushVirtualMemory/MmFlushVirtualMemory uses FsRtlAcquireFileForCcFlushEx, which takes a reference to underlying filesystem objects and pokes the normal rootkit^Wfiltering hooks, and then calls MiFlushSectionInternal, which walks the page tables, looks at the dirty flags, and calls MiWaitForPageWriteCompletion.

But so far as I can tell, ZwUnmapViewOfSectionEx calls MiUnmapVad, which then doesn't appear to call into any of the same infrastructure. It's possible that unmapping the pages somewhere deeper down results in a flush, but I'm not seeing it. And judging by @aclements' comment, it seems like empirically there's a difference between calling unmap and calling flush+unmap.

This is counterintuitive for me, and I assume for the rest of you too, because on Linux, munmap usually propagates the dirty bits into the pagecache, so that kswapd can get around to flushing it later, and so that there's some consistency if the file is mapped subsequently. That doesn't mean that munmap will wait for anything, but it does mean that Linux won't forget about those modifications and that they'll eventually be written to disk.

Option 1: It's possible that on Nt, unmapping a file without flushing won't actually propagate the dirty bits anywhere, and so the file essentially doesn't get completely written. That'd be weird, but who knows. Option 2: Another possibility would be that failing to flush doesn't make those file pages incoherent, but it does make calls to traditional file I/O return inconsistent values, and some other debug info parser in the toolchain is using traditional I/O to read things and crashing; this seems unlikely though, as if it were a parser error, you'd expect Go to just go out of bounds. Option 3: Like option 2, but is it possible that the file is copied after the linker is done with it, and it's that copy that is executed? In that case, the bug and evident fix would be easily explainable.

Of these, (1) seems possible, precisely for the reasons that (2) is at least something to consider: apparently normal I/O paths aren't using the same page cache as MapViewOfFile; MSDN says "A mapped view of a file is not guaranteed to be coherent with a file that is being accessed by the ReadFile." That makes me think cache coherence needs to be done explicitly, by way of calls to FlushViewOfFile. Unless things actually aren't that complicated with regard to dirty pages, and we're just running up against the edge case MSDN warns about in the case of (3).

Now the above is pretty speculative, based on some pretty brief IDAing around ntoskrnl.exe. If you demand more assurance, I guess I could actually spend some time learning about how the NT page cache and file I/O work. But it sounds like CL 235639 is working, so I'll +2 that shortly.

zx2c4 commented 4 years ago

I don't use windbg or crash dumps, so don't assume I am an expert. @zx2c4 is your man.

Similar disclaimer. But the bootleg way of doing this would be to just add a breakpoint to ExRaiseAccessViolation and to the invocation of KiExceptionDispatch from KiPageFault, and then you'll be able to inspect the entire system at that moment.

alexbrainman commented 4 years ago

Another thing that caught my attention is that when memory mapped file are used, it is possible for Windows exception to be raised during simple memory access.

https://docs.microsoft.com/en-us/windows/win32/memory/reading-and-writing-from-a-file-view

If something like that happens, Go exception handler would be called. But I expect to see Go stack trace.

Alex

zx2c4 commented 4 years ago

https://docs.microsoft.com/en-us/windows/win32/memory/reading-and-writing-from-a-file-view

That article is about 0xC0000006, but we're dealing with 0xC0000005.

zx2c4 commented 4 years ago

Okay I figured it out, I think. It's likely option 3, no fancy business. About time I read the actual Go source where this issue is happening.

src/cmd/link/internal/ld/main.go has this sequence:

        bench.Start("Munmap")
        ctxt.Out.Close() // Close handles Munmapping if necessary.

        bench.Start("undef")
        ctxt.undef()
        bench.Start("hostlink")
        ctxt.hostlink()

ctxt.Out.Close() calls UnmapViewOfFile, without flushing. Then ctxt.hostlink()is called, which makes a copy by way of calling hostobjCopy() and appending the results as arguments to the next stage:

        argv = append(argv, hostobjCopy()...)

Diving into that function, we see:

                        f, err := os.Open(h.file)
                        if err != nil {
                                Exitf("cannot reopen %s: %v", h.pn, err)
                        }
                        defer f.Close()
                        if _, err := f.Seek(h.off, 0); err != nil {
                                Exitf("cannot seek %s: %v", h.pn, err)
                        }

                        w, err := os.Create(dst)
                        if err != nil {
                                Exitf("cannot create %s: %v", dst, err)
                        }
                        if _, err := io.CopyN(w, f, h.length); err != nil {
                                Exitf("cannot write %s: %v", dst, err)
                        }
                        if err := w.Close(); err != nil {
                                Exitf("cannot close %s: %v", dst, err)
                        }

This is normal boring file I/O, which means it's not synchronized with the mapped file, unless flush was called prior to unmapping. MSDN makes that property abundantly clear. So, not all of the data written to the mapping is included in the copy. Kaboom.

CL 235639 is the correct fix.

aclements commented 4 years ago

Thank you for taking a close look at this @zx2c4 !

Diving into that function, we see:

The one thing that's still bothering me is, if this is where the crash is, why don't we see a stack trace?

The fact that we don't see a stack trace is why I was trying so hard to get a user-mode crash dump. Also, Go's exception handling mechanism never exits with the exception code; it would usually exit with status 2, or some small integer if things go really wrong. This suggests Go isn't handling the exception at all, and that it's in some Windows stack. So I was hoping a crash dump would at least tell me what process it was in, and maybe what Windows stack it was in.

This is normal boring file I/O, which means it's not synchronized with the mapped file, unless flush was called prior to unmapping. MSDN makes that property abundantly clear.

Is this true even when the file has been unmapped and reopened with a different handle? In UNIX, if you're going to interleave mmap'ed IO and read/write IO, you're advised to msync between them, but once you've closed, unmapped, and reopened the file, it's safe to do read/write IO. (msync is also a giant hammer that you don't want to overuse!)

cherrymui commented 4 years ago

I don't think hostobjCopy does anything with the output file we wrote using mmap. hostobjCopy copies "host object files", which are non-Go object files we are linking in.

cherrymui commented 4 years ago

Just a guess (completely unsure): the linker has written the output executable, and another process tries to execute it but failed (to even start running code), so we don't get any stack traces. (It might be similar to on UNIX if we somehow screw the ELF header or the dynamic loader fails.)

zx2c4 commented 4 years ago

I don't think hostobjCopy does anything with the output file we wrote using mmap. hostobjCopy copies "host object files", which are non-Go object files we are linking in.

You're right that hostobjCopy was barking up the wrong tree. So I went to see how go build does things and found the -x flag, and saw that it seemed, on Linux at least, to mv /path/to/temporary/exe /path/to/final/exe:

/usr/lib/go/pkg/tool/linux_amd64/link -o $WORK/b001/exe/a.out -importcfg $WORK/b001/importcfg.link -buildmode=exe -buildid=58H7_5kYPqii0uAGLa38/sqTE1WDirs7yz2m_Gl_O/J64dQnzG1o1jnUgl1kbu/58H7_5kYPqii0uAGLa38 -extld=x86_64-pc-linux-gnu-gcc $WORK/b001/_pkg_.a
mv $WORK/b001/exe/a.out a

That would seem fine according to what we know of the interaction between Windows IO and the page mapper so far. But then I opened up the function that emitted that, and look what I found:

// moveOrCopyFile is like 'mv src dst' or 'cp src dst'.
func (b *Builder) moveOrCopyFile(dst, src string, perm os.FileMode, force bool) error {
        if cfg.BuildN {
                b.Showcmd("", "mv %s %s", src, dst)
                return nil
        }

        // If we can update the mode and rename to the dst, do it.
        // Otherwise fall back to standard copy.

        // If the source is in the build cache, we need to copy it.
        if strings.HasPrefix(src, cache.DefaultDir()) {
                return b.copyFile(dst, src, perm, force)
        }

        // On Windows, always copy the file, so that we respect the NTFS
        // permissions of the parent folder. https://golang.org/issue/22343.
        // What matters here is not cfg.Goos (the system we are building
        // for) but runtime.GOOS (the system we are building on).
        if runtime.GOOS == "windows" {
                return b.copyFile(dst, src, perm, force)
        }

If we look at b.copyFile, we see another boring file I/O copy operation:

        df, err := os.OpenFile(dst, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, perm)
        if err != nil && base.ToolIsWindows {
                // Windows does not allow deletion of a binary file
                // while it is executing. Try to move it out of the way.
                // If the move fails, which is likely, we'll try again the
                // next time we do an install of this binary.
                if err := os.Rename(dst, dst+"~"); err == nil {
                        os.Remove(dst + "~")
                }
                df, err = os.OpenFile(dst, os.O_WRONLY|os.O_CREATE|os.O_TRUNC, perm)
        }
        if err != nil {
                return fmt.Errorf("copying %s: %w", src, err) // err should already refer to dst
        }

        _, err = io.Copy(df, sf)
        df.Close()

So again, voila. We have the pattern of: write using MapViewOfFile, forget to flush it, and then read that file using normal boring file I/O. MSDN makes it very clear that if you forget to flush, the contents you made with mapping might be out of sync with that of normal file I/O. That means the final executable that gets executed has some random parts of the file as zeros instead of as the actual file contents. This is, again, option 3 above.

Actually though, I deleted a step from that -x trace. Here's the real one:

/usr/lib/go/pkg/tool/linux_amd64/link -o $WORK/b001/exe/a.out -importcfg $WORK/b001/importcfg.link -buildmode=exe -buildid=58H7_5kYPqii0uAGLa38/sqTE1WDirs7yz2m_Gl_O/J64dQnzG1o1jnUgl1kbu/58H7_5kYPqii0uAGLa38 -extld=x86_64-pc-linux-gnu-gcc $WORK/b001/_pkg_.a
/usr/lib/go/pkg/tool/linux_amd64/buildid -w $WORK/b001/exe/a.out # internal
mv $WORK/b001/exe/a.out a

The missing step is the call to buildid. Guess what this does? More normal boring file I/O:

        f, err = os.OpenFile(file, os.O_WRONLY, 0)
        if err != nil {
                log.Fatal(err)
        }
        if err := buildid.Rewrite(f, matches, newID); err != nil {
                log.Fatal(err)
        }
        if err := f.Close(); err != nil {
                log.Fatal(err)
        }

So that means that either buildid is writing things back inconsistently, or that the mv that is actually a filecopy is writing things back inconsistently. Either way, getting an inconsistent executable is to be expected. With link using mapped I/O, the rest of the toolchain that runs after and doesn't use mapped I/O but rather uses normal boring file I/O is bound to create bogus files some percentage of the time, because its reads won't be consistent with link's writes.

I remain convinced that CL 235639 is correct and necessary if we take heed of MSDN's warnings.

zx2c4 commented 4 years ago

The one thing that's still bothering me is, if this is where the crash is, why don't we see a stack trace?

Take a normal working .exe, and then write zeros to various swaths of the file. Try to execute it. I would bet that there's at least a few places where writing pages of zeros causes the resultant executable to crash.

Is this true even when the file has been unmapped and reopened with a different handle? In UNIX, if you're going to interleave mmap'ed IO and read/write IO, you're advised to msync between them, but once you've closed, unmapped, and reopened the file, it's safe to do read/write IO. (msync is also a giant hammer that you don't want to overuse!)

The option 3 theory says: Mapping, unmapping, mapping will give a consistent file when reading from the second mapping. But mapping, unmapping, ReadFile-ing will not give a consistent file when using the ReadFile. MSDN makes this latter point very clear.

The option 1 theory says: Mapping, unmapping, mapping will not give a consistent file when reading from the second mapping, because the dirty bits from those pages are just dropped entirely when unmapping, and there's no nice pagecache and kswapd keeping things consistent like on Linux.

At first, option 1 seemed a tad bit more likely to me, because that would help explain the ReadFile quirk in a deeper way, as well as explain why we're getting these crashes: that when the PE loader maps the executable file, it's after those pages have been dropped entirely from before, and so its missing previously written but not flushed chunks.

But now, seeing how many times Go object files are copied around (see this latest https://github.com/golang/go/issues/38440#issuecomment-636526053 ), it seems like option 3 would be the most likely. Option 1 would be annoying and surprising behavior, after all, and option 3 is consistent with what's been expressed clearly on MSDN. And now we have Go code that precisely matches the bad edge case that MSDN says to avoid.

cherrymui commented 4 years ago

To be clear, I agree that CL 235639 is probably what we should do.

I also think that it is not safe to mix mmap and file I/O in the same process. That is why in the past, when the linker itself used both mmap and file I/O for the output binary, we did have a msync (FlushViewOfFile on windows, if mapping were used, but it was not due to permission bug) between the two. In this cycle, as we changed to use mmap only, the msync was removed (CL https://golang.org/cl/228385).

But I'm not clear whether it is safe if another process, after waiting for the linker process terminated (which unmaps the file and closes the FD before exiting), do anything with the file, read or execute. @aclements 's comment https://github.com/golang/go/issues/38440#issuecomment-636481353 seems to indicate this is safe on UNIX systems. The MSDN document doesn't make this explicit. But it is a possibility.

(Anything could happen after the linker process terminated. Besides what go command does, the user may open the file, execute it, or read it for inspection. We need to make sure that is safe.)

cherrymui commented 4 years ago

I'm also not really sure what go build does is really the cause. I looked at some of the failures, and they failed at go run or go test, which don't seem to copy the output file or write buildid, but just execute the binary. So it seems only executing the output after the linker process terminated is still not safe.

ianlancetaylor commented 4 years ago

@aclements If I'm reading you right, you're looking for a crash dump from cmd/link or cmd/go. But from this discussion it seems more likely to me that the executable is corrupt on disk, so running it crashes early, and the Go signal handling is never installed.

aclements commented 4 years ago

@ianlancetaylor , I'm looking for a crash dump from whatever's crashing. If it were cmd/link or cmd/go, I would expect a Go traceback, and since we don't see that, I suspect it's not cmd/link or cmd/go. If it's the corrupt binary, I would expect the crash dump to implicate some binary loader or something very early in process start up.

zx2c4 commented 4 years ago

But I'm not clear whether it is safe if another process, after waiting for the linker process terminated (which unmaps the file and closes the FD before exiting), do anything with the file, read or execute. @aclements 's comment #38440 (comment) seems to indicate this is safe on UNIX systems. The MSDN document doesn't make this explicit. But it is a possibility.

We know that another process using normal boring file I/O isn't safe. But what's unclear is whether remapping after dropping those pages is coherent. That's my option (1) earlier. This seems definitely possible, though it'd be a bit wild. The issue is essentially whether dirty pages are stored with their dirty bit in tact in some global page cache. Linux does this. The question is: does Windows?

jstarks commented 4 years ago

I'm not certain about the coherence question (based on the kernel data structures at play, I believe map+unmap+map (even with intervening close+open), will be coherent, but I can't speak from a position of authority).

But to @aclements's point, there are several code paths in the Windows loader that handle an access violation (or other exception) and call NtTerminateProcess without raising again. So it's certainly possible that a corrupt binary will exit without producing a crash dump. Unfortunately.

aclements commented 4 years ago

But to @aclements's point, there are several code paths in the Windows loader that handle an access violation (or other exception) and call NtTerminateProcess without raising again. So it's certainly possible that a corrupt binary will exit without producing a crash dump. Unfortunately.

Ah, that's interesting.

It seems like the most likely theory at this point is that the Windows loader uses read/write IO for at least some things (even if it memory maps the segments themselves), and that omitting the FlushViewOfFile causes that to be incoherent, which causes the loader to crash.

If it were a problem in the linker itself, we would expect a Go traceback (and possibly user-mode crash dumps). If the main contents of the binary were corrupted (e.g., incoherence led to zeroed pages being mapped in), we would expect a much wider variety of crashes.

jstarks commented 4 years ago

It seems like the most likely theory at this point is that the Windows loader uses read/write IO for at least some things (even if it memory maps the segments themselves), and that omitting the FlushViewOfFile causes that to be incoherent, which causes the loader to crash.

That would be surprising to me. I would expect that CreateProcess would construct an image section for the file, which I would expect to implicitly flush any remaining data section (and fail if there were any existing writable mappings). I'm not an expert in this area, and I did not read the code, so this is just my intuition.

However, this is not a problem. From @zx2c4's analysis, it sounded like the corruption was introduced earlier in the chain. If the Go linker is calling CopyFile (which calls NtReadFile) without calling FlushViewOfFile, then the copied binary could be corrupted due to incoherence well before CreateProcess is called on it.

aclements commented 4 years ago

That would be surprising to me. I would expect that CreateProcess would construct an image section for the file, which I would expect to implicitly flush any remaining data section (and fail if there were any existing writable mappings). I'm not an expert in this area, and I did not read the code, so this is just my intuition.

I'm not sure I understand what you're saying. What do you mean by "image section"? I wasn't saying there are any writable mappings at that point.

My theory is that CreateProcess (or the loader behind it) uses ReadFile to get a few things like headers before creating memory mappings for the bulk of the binary. Those memory mappings are probably fine because they're coherent with the data written via memory mappings in the linker. But ReadFile could be returning bad data since the linker didn't FlushViewOfFile, and that could cause loading the binary to fail.

However, this is not a problem. From @zx2c4's analysis, it sounded like the corruption was introduced earlier in the chain. If the Go linker is calling CopyFile (which calls NtReadFile) without calling FlushViewOfFile, then the copied binary could be corrupted due to incoherence well before CreateProcess is called on it.

We've seen failures in situations where the binary does not get copied.

cherrymui commented 4 years ago

My theory is that CreateProcess (or the loader behind it) uses ReadFile to get a few things like headers before creating memory mappings for the bulk of the binary. Those memory mappings are probably fine because they're coherent with the data written via memory mappings in the linker. But ReadFile could be returning bad data since the linker didn't FlushViewOfFile, and that could cause loading the binary to fail.

This is my guess as well. It probably needs to read the header to know the addresses and ranges of the mappings. I guess it is the header that gets corrupted, so it fails very early so that it doesn't print much information.

zx2c4 commented 4 years ago

I'm not seeing anywhere in ntoskrnl where the process image is being read through means other than its mapping. It looks like NtCreateUserProcess calls createfile and then maps it pretty soon after. I didn't see much in userspace CreateProcessInternalW either, besides reading file metadata. But @jstarks can probably consult the source and find something more definitive than me poking around for 5 minutes in the disassembler.

zx2c4 commented 4 years ago

Actually, NtCreateUserProcess->MmCreateSpecialImageSection->MiCreateSection->MiCreateImageOrDataSection->MiCreateNewSection->MiCreateImageFileMap->MiReadImageHeaders->MiPageRead->IoPageReadEx->an irp gets queued up. So that might give credence to what @cherrymui just suggested.

jstarks commented 4 years ago

It looks like MiCreateImageFileMap flushes the data section (i.e. waits for modified pages to be written to the backing file) before doing any reads. So if the MiReadImageHeaders read is the first one in the CreateProcess path, then I wouldn't expect any coherence issues in that path.

Let me see if I can get someone from the Mm team to chime in.

zx2c4 commented 4 years ago

Let me see if I can get someone from the Mm team to chime in.

Would be very interested. I'd also like to know if the above option (1) is a real possibility. Specifically, will Windows just discard unflushed pages and ignore the dirty bit if you unmap without flushing first? I'd expect the answer to be "no, that would be insane", but ya never know. The whole caveat of fscache/normal I/O being somehow separate from mmap'd I/O is sufficiently weird that anything sort of seems possible there.

jstarks commented 4 years ago

Interestingly, according to the Mm team, mappings and ReadFile should be coherent as long as the file was opened cached, i.e. without FILE_NO_INTERMEDIATE_BUFFERING. The fact that they appear not to be in this case may be an OS bug.

We (well, not me, but others here in Windows) will try to get a local repro under stress. In the meantime, it sounds like flushing the mapping explicitly is a reasonable workaround.

Thanks for looping me in. @aclements, on which OS build are you reproducing this?

zx2c4 commented 4 years ago

Interestingly, according to the Mm team, mappings and ReadFile should be coherent as long as the file was opened cached

Oh, good, that's reassuring. I was worried that these being not coherent implied some very odd IO cache architecture.

alexbrainman commented 4 years ago

I tried simulating corrupted file on Windows

https://play.golang.org/p/5aZQqi7pS8o

and that is what I see

C:\>test
--- FAIL: TestCorruptedExecutable (0.33s)
    a_test.go:76: Child: exit status 3221225477
        ""
FAIL

C:\>

Exactly what we see on the builder.

Child process has no output (stack trace or otherwise), because it has not run yet.

And parent process prints child's exit code of 0xC0000005, and empty output.

Alex

aclements commented 4 years ago

@aclements, on which OS build are you reproducing this?

This is on the windows-amd64-longtest builder, which reports as:

$ VM=$(gomote create windows-amd64-longtest); gomote run -system $VM systeminfo
Host Name:                 SERVER-2016-V7-
OS Name:                   Microsoft Windows Server 2016 Datacenter
OS Version:                10.0.14393 N/A Build 14393
OS Manufacturer:           Microsoft Corporation
OS Configuration:          Standalone Server
OS Build Type:             Multiprocessor Free
Registered Owner:          N/A
Registered Organization:   N/A
Product ID:                -
Original Install Date:     7/2/2018, 5:50:15 PM
System Boot Time:          6/2/2020, 1:10:45 PM
System Manufacturer:       Google
System Model:              Google Compute Engine
System Type:               x64-based PC
Processor(s):              1 Processor(s) Installed.
                           [01]: Intel64 Family 6 Model 63 Stepping 0 GenuineIntel ~2300 Mhz
BIOS Version:              Google Google, 1/1/2011
Windows Directory:         C:\Windows
System Directory:          C:\Windows\system32
Boot Device:               \Device\HarddiskVolume1
System Locale:             en-us;English (United States)
Input Locale:              en-us;English (United States)
Time Zone:                 (UTC+00:00) Monrovia, Reykjavik
Total Physical Memory:     14,746 MB
Available Physical Memory: 13,750 MB
Virtual Memory: Max Size:  15,770 MB
Virtual Memory: Available: 14,800 MB
Virtual Memory: In Use:    970 MB
Page File Location(s):     C:\pagefile.sys
Domain:                    WORKGROUP
Logon Server:              \\SERVER-2016-V7-
Hotfix(s):                 5 Hotfix(s) Installed.
                           [01]: KB3186568
                           [02]: KB3199986
                           [03]: KB4049065
                           [04]: KB4132216
                           [05]: KB4284880
Network Card(s):           1 NIC(s) Installed.
                           [01]: Red Hat VirtIO Ethernet Adapter
                                 Connection Name: Ethernet
                                 DHCP Enabled:    Yes
                                 DHCP Server:     169.254.169.254
                                 IP address(es)
                                 [01]: 10.240.0.19
Hyper-V Requirements:      A hypervisor has been detected. Features required for Hyper-V will not be displayed.

I tried simulating corrupted file on Windows

@alexbrainman, fascinating.

Y'know, if the text is zeroes, it disassembles to add %al,(%rax), which will indeed cause an access violation for most values of rax.

But surely the text section is also being memory mapped by the loader, so it wouldn't be incoherent, right? Or maybe this is the OS bug @jstarks is sensing. And why would be text be incoherent but not the executable headers (which we write out after the text)? So many questions...

alexbrainman commented 4 years ago

Y'know, if the text is zeroes, it disassembles to add %al,(%rax), which will indeed cause an access violation for most values of rax.

But surely the text section is also being memory mapped by the loader, so it wouldn't be incoherent, right?

I made my example up. I don't know, if it is only .text section is corrupted. And I don't know, if corrupted area is filled with 0.

Alex

zx2c4 commented 4 years ago

The "where" part is as yet undetermined, but the "what" part is almost certainly zeros, as that's what the file contains when you truncate a zero sized file to a non-zero sized file.

alexbrainman commented 4 years ago

but the "what" part is almost certainly zeros, as that's what the file contains when you truncate a zero sized file to a non-zero sized file.

Thanks for explaining. I suppose I hit jackpot when I decided to put 0s there.

Alex