git-for-windows / git

A fork of Git containing Windows-specific patches.
http://gitforwindows.org/
Other
8.31k stars 2.53k forks source link

Git 2.28.0: The content of diff containing CJK strings cannot be displayed correctly in conhost #2802

Closed fcharlie closed 4 years ago

fcharlie commented 4 years ago

Setup

$ git --version --build-options

git version 2.28.0.windows.1
cpu: x86_64
built from commit: 77982caf269b7ee713a76da2bcf260c34d3bf7a7
sizeof-long: 4
sizeof-size_t: 8
shell-path: /bin/sh
$ cmd.exe /c ver

Microsoft Windows [版本 10.0.19042.450]
# One of the following:
> type "C:\Program Files\Git\etc\install-options.txt"
> type "C:\Program Files (x86)\Git\etc\install-options.txt"
> type "%USERPROFILE%\AppData\Local\Programs\Git\etc\install-options.txt"
$ cat /etc/install-options.txt

Editor Option: VIM
Custom Editor Path:
Path Option: BashOnly
SSH Option: OpenSSH
Tortoise Option: false
CURL Option: OpenSSL
CRLF Option: CRLFCommitAsIs
Bash Terminal Option: MinTTY
Git Pull Behavior Option: Merge
Use Credential Manager: Core
Performance Tweaks FSCache: Enabled
Enable Symlinks: Enabled
Enable Pseudo Console Support: Enabled

insert your response here

Details

Windows Terminal

git diff

The CJK string is displayed correctly as in Mintty.

image

Windows Terminal snapshot:

image

Git CMD snapshot:

image

dscho commented 4 years ago

This probably has the same underlying root cause as msys2/MSYS2-packages#1974. Could you call cmd /c chcp 65001 and try again?

dscho commented 4 years ago

Oh, Windows Terminal... What is the value of LANG? And of LC_ALL and LC_CTYPE?

fcharlie commented 4 years ago

Oh, Windows Terminal... What is the value of ? And of and ?LANG``LC_ALL``LC_CTYPE

Windows Terminal(Powershell) not set LANG LC_ALL and LC_CTYPE

After LANG is set, the display is normal, which seems to be a duplicate problem.

image

dscho commented 4 years ago

Path Option: BashOnly

This suggests that git should not even be found when you try to call it in PowerShell. However, it is obviously found. Which makes me wonder, where it is found. In Bash, of course, LANG is configured in /etc/profile.d/lang.sh, which is why it works there.

I vividly remember that we tried, once upon a time, to configure LC_CTYPE=C.UTF-8 as a fall-back and that really messed up some use cases, which is why we now fall back to LC_CTYPE=C: https://github.com/git-for-windows/git/blob/77982caf269b7ee713a76da2bcf260c34d3bf7a7/compat/mingw.c#L3372-L3373

Not sure whether there is anything we can do to help this situation better, not if we want to avoid similar breakages...

dscho commented 4 years ago

Just to make sure, could you re-test with the latest snapshot, please?

fcharlie commented 4 years ago

@dscho Thanks. It seems that git diff is working properly.

download: https://wingit.blob.core.windows.net/files/MinGit-prerelease-2.28.0.windows.1-51-g87c86a26de-20200923135803-64-bit.zip (missing less) image

https://wingit.blob.core.windows.net/files/PortableGit-prerelease-2.28.0.windows.1-51-g87c86a26de-20200923135803-64-bit.7z.exe Not working properly

image

dscho commented 4 years ago

Could you push that cjk repository somewhere, for easy testing?

fcharlie commented 4 years ago

Could you push that cjk repository somewhere, for easy testing?

Open Windows Terminal (PowerShell Core) :


git clone https://github.com/fcharlie/cjkdemo.git
cd cjkdemo
git diff f789f9f390c80ce6cc59b7770fe833554140d83f
dscho commented 4 years ago

Okay, I gave it a try and could verify that:

dscho commented 4 years ago

@fcharlie I think I have half a fix here: https://github.com/git-for-windows/git/pull/2834

dscho commented 4 years ago

I think I have half a fix here: #2834

@fcharlie I hope you have a good idea how to fix the other half because I sure don't.

fcharlie commented 4 years ago

@dscho I have a solution here. In git wapper, we always set LANG to **.UTF-8 according to the user's locale when LANG is not set.

dscho commented 4 years ago

I have a solution here. In git wapper, we always set LANG to **.UTF-8 according to the user's locale when LANG is not set.

Unfortunately, I do not remember the details, and failed to dig up the context, but I vividly remember that we tried that and that it wreaked havoc.

We cannot just set LANG=**.UTF-8 unless we know that the code page supports it.

Unfortunately, Windows Terminal does not set the code page to UTF-8, otherwise my fix would work without any additional work.

fcharlie commented 4 years ago

I have a solution here. In git wapper, we always set LANG to **.UTF-8 according to the user's locale when LANG is not set.

Unfortunately, I do not remember the details, and failed to dig up the context, but I vividly remember that we tried that and that it wreaked havoc.

We cannot just set unless we know that the code page supports it.LANG=**.UTF-8

Unfortunately, Windows Terminal does not set the code page to UTF-8, otherwise my fix would work without any additional work.

Maybe we should reimplement functions like fprintf in git in Windows to support git encoding completely as UTF-8, and then let all be processed in accordance with UTF-8, no longer consider the LANG setting. To be honest, I think that now to evaluate CodePage, this thing is completely dross.

File not tty: fprintf -- UTF-8 --> WriteFile File is console: fprintf -- UTF-8 to UTF-16 --> WriteConsoleW

I saw that git has a similar mechanism, but it feels a bit confusing.

void winansi_init(void)
{
    int con1, con2;
    wchar_t name[32];

    /* check if either stdout or stderr is a console output screen buffer */
    con1 = is_console(1);
    con2 = is_console(2);

    /* Also compute console bit for fd 0 even though we don't need the result here. */
    is_console(0);

    if (!con1 && !con2) {
#ifdef DETECT_MSYS_TTY
        /* check if stdin / stdout / stderr are MSYS2 pty pipes */
        detect_msys_tty(0);
        detect_msys_tty(1);
        detect_msys_tty(2);
#endif
        return;
    }

    /* create a named pipe to communicate with the console thread */
    if (swprintf(name, ARRAY_SIZE(name) - 1, L"\\\\.\\pipe\\winansi%lu",
             GetCurrentProcessId()) < 0)
        die("Could not initialize winansi pipe name");
    hwrite = CreateNamedPipeW(name, PIPE_ACCESS_OUTBOUND,
        PIPE_TYPE_BYTE | PIPE_WAIT, 1, BUFFER_SIZE, 0, 0, NULL);
    if (hwrite == INVALID_HANDLE_VALUE)
        die_lasterr("CreateNamedPipe failed");

    hread = CreateFileW(name, GENERIC_READ, 0, NULL, OPEN_EXISTING, 0, NULL);
    if (hread == INVALID_HANDLE_VALUE)
        die_lasterr("CreateFile for named pipe failed");

    /* start console spool thread on the pipe's read end */
    hthread = CreateThread(NULL, 0, console_thread, NULL, 0, NULL);
    if (hthread == INVALID_HANDLE_VALUE)
        die_lasterr("CreateThread(console_thread) failed");

    /* schedule cleanup routine */
    if (atexit(winansi_exit))
        die_errno("atexit(winansi_exit) failed");

    /* redirect stdout / stderr to the pipe */
    if (con1)
        hconsole1 = swap_osfhnd(1, duplicate_handle(hwrite));
    if (con2)
        hconsole2 = swap_osfhnd(2, duplicate_handle(hwrite));
}

https://github.com/git-for-windows/git/blob/675a4aaf3b226c0089108221b96559e0baae5de9/compat/mingw.c#L1488-L1736

fcharlie commented 4 years ago

Implementation of golang:

https://github.com/golang/go/blob/1f4d035178d2d792a74b6b872f6a213bf5fd9326/src/internal/poll/fd_windows.go#L650-L691

// Write implements io.Writer.
func (fd *FD) Write(buf []byte) (int, error) {
    if err := fd.writeLock(); err != nil {
        return 0, err
    }
    defer fd.writeUnlock()
    if fd.isFile {
        fd.l.Lock()
        defer fd.l.Unlock()
    }

    ntotal := 0
    for len(buf) > 0 {
        b := buf
        if len(b) > maxRW {
            b = b[:maxRW]
        }
        var n int
        var err error
        if fd.isFile {
            switch fd.kind {
            case kindConsole:
                n, err = fd.writeConsole(b)
            default:
                n, err = syscall.Write(fd.Sysfd, b)
                if fd.kind == kindPipe && err == syscall.ERROR_OPERATION_ABORTED {
                    // Close uses CancelIoEx to interrupt concurrent I/O for pipes.
                    // If the fd is a pipe and the Write was interrupted by CancelIoEx,
                    // we assume it is interrupted by Close.
                    err = ErrFileClosing
                }
            }
            if err != nil {
                n = 0
            }
        } else {
            if race.Enabled {
                race.ReleaseMerge(unsafe.Pointer(&ioSync))
            }
            o := &fd.wop
            o.InitBuf(b)
            n, err = execIO(o, func(o *operation) error {
                return syscall.WSASend(o.fd.Sysfd, &o.buf, 1, &o.qty, 0, &o.o, nil)
            })
        }
        ntotal += n
        if err != nil {
            return ntotal, err
        }
        buf = buf[n:]
    }
    return ntotal, nil
}

// writeConsole writes len(b) bytes to the console File.
// It returns the number of bytes written and an error, if any.
func (fd *FD) writeConsole(b []byte) (int, error) {
    n := len(b)
    runes := make([]rune, 0, 256)
    if len(fd.lastbits) > 0 {
        b = append(fd.lastbits, b...)
        fd.lastbits = nil

    }
    for len(b) >= utf8.UTFMax || utf8.FullRune(b) {
        r, l := utf8.DecodeRune(b)
        runes = append(runes, r)
        b = b[l:]
    }
    if len(b) > 0 {
        fd.lastbits = make([]byte, len(b))
        copy(fd.lastbits, b)
    }
    // syscall.WriteConsole seems to fail, if given large buffer.
    // So limit the buffer to 16000 characters. This number was
    // discovered by experimenting with syscall.WriteConsole.
    const maxWrite = 16000
    for len(runes) > 0 {
        m := len(runes)
        if m > maxWrite {
            m = maxWrite
        }
        chunk := runes[:m]
        runes = runes[m:]
        uint16s := utf16.Encode(chunk)
        for len(uint16s) > 0 {
            var written uint32
            err := syscall.WriteConsole(fd.Sysfd, &uint16s[0], uint32(len(uint16s)), &written, nil)
            if err != nil {
                return 0, err
            }
            uint16s = uint16s[written:]
        }
    }
    return n, nil
}
dscho commented 4 years ago

We already write bytes to files as-are, without re-encoding. The winansi code you found targets the Win32 Console and needs to re-encode (in particular ANSI control sequences).

That still does not answer the question how we could discern the case where Windows Terminal pretends not to handle Unicode but actually does handle Unicode.

fcharlie commented 4 years ago

@dscho Windows Terminal is no different from the old Console, except that it can handle ANSI better. Therefore, Git CMD will also have this problem. In addition, the internal behavior of git is not consistent. For example, git commit can be displayed correctly, but git log/git diff cannot

image

I use the blobview program written by golang to check the commit, but it is actually UTF-8

image

So there may be two problems here. The first is the inconsistency of git commit/git diff/git log processing, and the second is my previous reply. Git follows the old locale setting. The complexity of this setting makes it easy to appear problem.

Of course, I think this is a universal problem with a lot of historical baggage, and it is good to be able to let go of these baggage.

dscho commented 4 years ago

git commit can be displayed correctly, but git log/git diff cannot

That's because git commit does not use the pager, but git log/git diff do. And that pager is less.exe provided by the MSYS2 project, which really needs to know the output encoding (because it does not take git.exe's shortcuts that may or may not be appropriate).

fcharlie commented 4 years ago

git commit can be displayed correctly, but git log/git diff cannot

That's because does not use the pager, but / do. And that pager is provided by the MSYS2 project, which really needs to know the output encoding (because it does not take 's shortcuts that may or may not be appropriate).git commit``git log``git diff``less.exe``git.exe

So can we reimplement less?

dscho commented 4 years ago

So can we reimplement less?

Oh my. If you want to. I'll not get into this, though...

fcharlie commented 4 years ago

So can we reimplement less?

Oh my. If you want to. I'll not get into this, though...

Or patch less?

dscho commented 4 years ago

Or patch less?

Or patch the MSYS2 runtime.

But how? That is the big question.

Because you can get less.exe to work correctly, by setting LC_CTYPE (or LANG) in a specific way. As my PR demonstrates, we can do that even in git.exe. But the biggest problem remains: how to discern the situations when it is appropriate (when the code page is UTF-8) vs when it is not appropriate (IIRC there was a problem when the code page was set to something non-ASCII, non-UTF-8).

dscho commented 4 years ago

IIRC there was a problem when the code page was set to something non-ASCII, non-UTF-8

I think I found the original report: https://groups.google.com/g/git-for-windows/c/9_abrdlqi48/m/S9pJE56ZDgAJ

However, I just re-tested with ConEmu and cannot replicate that behavior. Therefore, I will go ahead and just replace the "C" by "C.UTF-8" and hope that it does not break anyone's scenario.

dscho commented 4 years ago

@fcharlie I went ahead and updated https://github.com/git-for-windows/git/pull/2834. When you get a chance, please test this:

  1. install Git for Windows' SDK,
  2. sdk cd git,
  3. git pull https://github.com/dscho/git/ lctype-should-adapt-to-console-code-page
  4. build Git via make -j$(nproc)
  5. copy git.exe into a portable Git and test
fcharlie commented 4 years ago

@fcharlie I went ahead and updated #2834. When you get a chance, please test this:

  1. install Git for Windows' SDK,
  2. sdk cd git,
  3. git pull https://github.com/dscho/git/ lctype-should-adapt-to-console-code-page
  4. build Git via make -j$(nproc)
  5. copy git.exe into a portable Git and test

Good I will try it out and give you feedback, thank you.

fcharlie commented 4 years ago

@dscho After testing, your patch is effective. Thanks.

image

dscho commented 4 years ago

Very good!