git branch name encoding UTF-8

kkato233 commented 4 years ago

git branch name 顧客

fix probrem after this

PhilipOakley commented 4 years ago

it looks reasonable. tcl does have some problems with wanting to auto convert various language - charset settings (I had an issue a while back)

IIUC you have also sent a fix for gitk upstream msgID <TY2PR01MB24271C32E2FD9FD8C27CA8C2CA5D0@TY2PR01MB2427.jpnprd01.prod.outlook.com> I think it's the one that's in the screenshots ;-)

kkato233 commented 4 years ago

I sent two fix for gitk and git gui. both utf8 encoding probrem.

https://github.com/kkato233/gitk/commit/89f89a0990 https://github.com/kkato233/git-gui/commit/b96536dd9

prati0100 commented 4 years ago

Hi @kkato233

git-gui contributions are reviewed on the Git mailing list (git@vger.kernel.org), same as gitk (but you should still base the patch on this repo, and not on git/git). Check the README for more info.

Since I'm not familiar with the language for which this bug report exists (I'm guessing its Japanese), I can't really understand what exactly is wrong. And so, I can't effectively judge if the patch is doing the right thing or not. So, a better commit message subject and description would be nice.

That said, that patch looks good. And @PhilipOakley's review re-assures me :). Thanks for the contribution.

logiclrd commented 4 years ago

Ah yes, I worked in Japan for a year a long while back, and remember all the fun around encodings. :-)

Tcl's documentation includes the comment:

By default, Tcl uses the system encoding when reading from and writing to channels

On Japanese Windows installations, the system encoding is Shift-JIS, which is totally different from UTF-8. I can't actually find documentation of a prescribed encoding for the .git/HEAD file. As far as I can tell, though, from a quick (and entirely non-comprehensive) look over the code, it looks like Git itself treats all refs including HEAD as ASCII encoding, reading them into a char *, splitting off prefixes like refs/ and then grabbing everything up to '\0' (or EOF). Maybe I completely missed something, but it seems to me that Git isn't working with code points, it is just working with bytes, and if layers on top of Git are interpreting those bytes as UTF-8, that may be useful to people but there's nothing enforcing that the underlying data is UTF-8. Putting UTF-8 data into them happens to work because UTF-8 is backward-compatible with null-terminated ASCII strings.

I just tested this locally. I looked up a list of byte sequences that are not valid UTF-8, arbitrarily picked \xC3\x28 (described as "Invalid 2 Octet Sequence" here), and was able to create a branch by that name:

logiclrd@RESONATOR:/mnt/b/code/git$ git branch `printf "\xC3\x28"`
logiclrd@RESONATOR:/mnt/b/code/git$ ls -l .git/refs/heads/
total 0
drwxrwxrwx 1 logiclrd logiclrd 512 Oct 29 22:52  git-gui
-rwxrwxrwx 1 logiclrd logiclrd  41 Nov 30 20:13  git-gui-revert-untracked
-rwxrwxrwx 1 logiclrd logiclrd  41 Oct 29 22:51  git-gui-revert-untracked-main-git-line
-rwxrwxrwx 1 logiclrd logiclrd  41 Oct 24 16:27  master
-rwxrwxrwx 1 logiclrd logiclrd  41 Dec  6 01:48 '�('
logiclrd@RESONATOR:/mnt/b/code/git$ git log --pretty=oneline -n 5 `printf "\xC3\x28"`
566a1439f6f56c2171b8853ddbca0ad3f5098770 (HEAD -> master, tag: v2.24.0-rc1, origin/master, origin/HEAD, �() Git 2.24-rc1
04b1f4f768bc36cc074a311b9c5e339a9fce160a Merge branch 'sg/ci-osx-gcc8-fix'
4d6fb2beeb80f4899b0267d91c983c91772438f0 Merge branch 'ds/feature-macros'
5f0b6ed90739f9a62685360be4a30fe0b80ce06a Merge branch 'js/azure-ci-osx-fix'
c555caab7a303109d6c712d757bc4621a3ee0bbd Merge branch 'bw/format-patch-o-create-leading-dirs'
logiclrd@RESONATOR:/mnt/b/code/git$ git checkout `printf "\xC3\x28"`
Switched to branch '('
logiclrd@RESONATOR:/mnt/b/code/git$ less .git/HEAD
ref: refs/heads/<C3>(
logiclrd@RESONATOR:/mnt/b/code/git$ git status
On branch (
nothing to commit, working tree clean
logiclrd@RESONATOR:/mnt/b/code/git$ git log --pretty=oneline -n 5
566a1439f6f56c2171b8853ddbca0ad3f5098770 (HEAD, tag: v2.24.0-rc1, origin/master, origin/HEAD, �(, master) Git 2.24-rc1
04b1f4f768bc36cc074a311b9c5e339a9fce160a Merge branch 'sg/ci-osx-gcc8-fix'
4d6fb2beeb80f4899b0267d91c983c91772438f0 Merge branch 'ds/feature-macros'
5f0b6ed90739f9a62685360be4a30fe0b80ce06a Merge branch 'js/azure-ci-osx-fix'
c555caab7a303109d6c712d757bc4621a3ee0bbd Merge branch 'bw/format-patch-o-create-leading-dirs'
logiclrd@RESONATOR:/mnt/b/code/git$

Without this change, there is still a bug here because "the system encoding" could very easily not be compatible with ASCII, but with this change, what will happen when Tcl tries to interpret the data in HEAD as UTF-8?

Well, I tried to test exactly that, and found that apparently Tcl just loaded the invalid UTF-8 sequence as raw bytes. Not sure if -encoding utf-8 is actually doing anything, in that case??

logiclrd@RESONATOR:/mnt/b/code/git/.git$ tclsh
% set x [open HEAD]
file3
% fconfigure $x -translation binary -encoding utf-8
% gets $x ref
18
% puts $ref
ref: refs/heads/Ã(
%

Interestingly, while I can do all of this in Ubuntu using Windows Services for Linux, the Windows build of Git has serious problems with the repository in this state. (The filesystem is shared, so /mnt/b/code/git is actually the same thing as B:\code\git for Windows apps.)

git status reports that all files in the repo are new files scheduled for addition. Git Gui thinks the same thing:

If I try to browse the branch's files:

And Gitk says:

I can't seem to get Git Gui running on the Linux side to talk to an Xming X server at the moment, it just hangs when starting, so I can't confirm what Git Gui does when it's running in the environment where Git set up this branch name and it works. But, my guess is that you probably can't make arbitrary byte sequence branch names on Windows, something is applying a character encoding somewhere along the way, but as far as I can tell you can on Linux, and so it bears consideration what Git Gui and Gitk are going to do when running against such a configuration.

It is of course an option to simply say, "Yes, you can make Git branch names that are invalid UTF-8 sequences, but we choose not to support that in Git Gui." :-P

logiclrd commented 4 years ago

Okay now I feel silly, the Git Gui window did open from the Linux side. It just opened on my other monitor.

Here are the same Git Gui actions when running through Linux with the wonky branch name as HEAD:

(this looks right)

(this looks wrong)

(still doesn't like it)

(look in the title bar -- looks like somewhere along the way, the string got forced into UTF-8)

I just picked a random file and put a change into it to confirm that the main Git Gui view is working:

And if I put the change from this PR into the Git Gui code running from Linux, it still works properly, confirming my previous finding that Tcl isn't strict about UTF-8 encoding, and will simply copy bytes over that can't be decoded as a code point.

logiclrd commented 4 years ago

I also found something like documentation of what is valid for branch names:

A branch name can not:
    - Have a path component that begins with "."
    - Have a double dot ".."
    - Have an ASCII control character, "~", "^", ":" or SP, anywhere
    - End with a "/"
    - End with ".lock"
    - Contain a "\" (backslash

https://public-inbox.org/git/1277146419-19507-1-git-send-email-avarab@gmail.com/

Technically speaking, it doesn't say that "\xC3\x28" is invalid. ¯\_(ツ)_/¯

PhilipOakley commented 4 years ago

From a deeper historical and implementation perspective it gets complicated.

First, in Linux, the branch (ref) name (a nul terminated string) must always be a valid Linux path (backward compatibility implementation detail), hence "/" is both special (linux dire separator) and 'just a char byte'.

Git (and I think Linix) got into utf-8 as its extended default character set, because if 100% fitted with 7bit ascii, but is still backward compatible with the "it's a string && its a path".

However tcl goes it's own way regarding auto-conversion of strings. in https://www.tutorialspoint.com/tcl-tk/tcl_strings.htm it says "Tcl uses 16 bit unicode characters and alphanumeric characters can contain letters including non-Latin characters, number or punctuation." but I don't know how much to trust that. Mainly the man pages fail to say what encoding is being used.

Git also acquired the ref restrictions described above. I expected that git help revisions would include those rule but it looks like a doc patch is needed... (note to self).

A lot of fun.

logiclrd commented 4 years ago

Would there be merit in teaching Git itself to reject reference names that aren't valid UTF-8? Is that likely to cause problems for anybody anywhere? Compared to the current approach where, at least on non-Windows, it looks like no encoding is enforced at all.

If Git were taught to refuse to create non-UTF-8 refs, but continued to process all refs the way it does now, then anything anyone has already created that they're using would continue working, they just couldn't create new ones that aren't UTF-8.

logiclrd commented 4 years ago

I guess environments where the operating system is currently providing a layer of translation that wasn't explicitly asked for (i.e., Windows, per my testing above) would pose a potential issue during transition, w.r.t. international branch names already in use. Though, I haven't checked that what Windows is doing isn't actually already trying to parse the branch name as UTF-8, so maybe that's a non-issue. :-P

kkato233 commented 4 years ago

for example git client branch name internationalization status . × git Cmd 〇 git bash 〇 GitHub Desktop 〇 Visual Studio Code 〇 GitHub web site

https://github.com/kkato233/git-encoding-branch/tree/漢字

logiclrd commented 4 years ago

Looks like there's already substantial precedent for things assuming it will be UTF-8 even though Git itself doesn't seem to enforce that, in that case.

prati0100 commented 4 years ago

IIUC, supporting UTF-8 in git-gui won't hurt at the very least. Yes, it does not strictly follow Git's encoding (ASCII, as your findings suggest), but with UTF-8 being backward-compatible with ASCII, there won't be any downsides -- only upsides. We support people who aren't exactly following the "spec" while leaving those who are alone. Sounds like a win-win to me. Also, like @kkato233 pointed out, most Git frontends do support UTF-8.

And people using invalid UTF-8 in their ref names are just asking for trouble. Even in the case of people using encodings incompatible with ASCII, I don't think it is our problem. Yes, git-gui will show something weird, but so will everything else.

Would there be merit in teaching Git itself to reject reference names that aren't valid UTF-8?

I don't think so. I see no harm in letting refnames be in other languages. But I won't claim to completely understand all the design decisions in Git. People like Junio, Peff, et al. should have a much better idea than me. So, these are just my 2 cents. Ask on the list for a definitive answer :)

From all this discussion, I gather that the patch is all OK (apart from the commit message). I'm not missing anything, right?

prati0100 commented 4 years ago

However tcl goes it's own way regarding auto-conversion of strings. in https://www.tutorialspoint.com/tcl-tk/tcl_strings.htm it says "Tcl uses 16 bit unicode characters and alphanumeric characters can contain letters including non-Latin characters, number or punctuation." but I don't know how much to trust that. Mainly the man pages fail to say what encoding is being used.

Looking at [0], it seems your source is indeed wrong. To quote from the page:

Beginning in Tcl 8.1, Tcl represents all strings internally as Unicode characters in UTF-8 format.

Also:

Tcl automatically converts strings from UTF-8 format to the system encoding and vice versa whenever it communicates with the operating system. Tcl usually can determine a reasonable default system encoding based on these settings, but if for some reason it cannot, it uses ISO 8859-1 as the default system encoding. The default encoding for newly opened channels (both files and sockets) is the same as the platform- and locale-dependent system encoding used for interfacing with the operating system

(The quote is multiple lines at different locations stitched together, not a single paragraph in the source text)

So, I guess Tcl used the default encoding for Japanese Windows systems, which I guess in Shift-JIS as the encoding for the channel. With Shift-JIS being incompatible with UTF-8, it showed the wrong thing.

[0] https://www.tcl.tk/doc/howto/i18n.html

logiclrd commented 4 years ago

IIUC, supporting UTF-8 in git-gui won't hurt at the very least. Yes, it does not strictly follow Git's encoding (ASCII, as your findings suggest), but with UTF-8 being backward-compatible with ASCII, there won't be any downsides -- only upsides. We support people who aren't exactly following the "spec" while leaving those who are alone.

Technically, those who are exactly following the spec might have code sequences that aren't valid UTF-8. Not to suggest that they might be intentionally loading it with broken sequences, but e.g. to follow the theme of this thread, if they have their console set to Shift-JIS encoding, then I would expect git on the command-line to display all the characters properly with Shift-JIS in the ref names. Git doesn't care what the characters are, just that they are properly null-terminated, and if it copies Shift-JIS data from the ref names to the console and the terminal is properly configured for it, then it'll display properly.

Making Git Gui do UTF-8 exclusively means that those people (I don't know how many of them exist, if any) literally cannot use Git Gui for their workflow.

If Git itself were updated so that ref names have to be valid UTF-8, then that would have the potential downside of forcing people to make their terminals interpret UTF-8 if they want extended characters/code points, but many people are probably already using UTF-8 terminals, and it carries the upside of making the UTF-8 assumption so many tools are already making contractual.

That's just my take on it, I'm not terribly invested either way because my workflows don't typically involve anything out of the 7-bit ASCII plane :-)

kkato233 commented 4 years ago

create new pull request https://github.com/prati0100/git-gui/pull/22

PhilipOakley commented 4 years ago

On 06/12/2019 08:35, Jonathan Gilbert wrote:

I also found something like documentation of what is valid for branch names:
A branch name can not: - Have a path component that begins with
"." - Have a double dot ".." - Have an ASCII control character,
"~", "^", ":" or SP, anywhere - End with a "/" - End with ".lock"
- Contain a "\" (backslash
https://public-inbox.org/git/1277146419-19507-1-git-send-email-avarab@gmail.com/

Technically speaking, it doesn't say that "\xC3\x28" is invalid. ¯_(ツ)_/¯

Following up that old thread (21 Jun 2010), I see it petered out.

It feels like this should actually be positively included in the documentation, maybe in the core text of git help revisions, which one would expect to have this clear summary of what a head ref name is allowed to include.

-- Philip

PhilipOakley commented 4 years ago

On 07/12/2019 10:00, Philip Oakley wrote:

On 06/12/2019 08:35, Jonathan Gilbert wrote:

I also found something like documentation of what is valid for branch names:

A branch name can not: - Have a path component that begins with "." - Have a double dot ".." - Have an ASCII control character, "~", "^", ":" or SP, anywhere - End with a "/" - End with ".lock"

Contain a "\" (backslash

https://public-inbox.org/git/1277146419-19507-1-git-send-email-avarab@gmail.com/

Technically speaking, it doesn't say that "\xC3\x28" is invalid. ¯_(ツ)_/¯

Following up that old thread (21 Jun 2010), I see it petered out.

It feels like this should actually be positively included in the documentation, maybe in the core text of git help revisions, which one would expect to have this clear summary of what a head ref name is allowed to include.

In git help branch, under the <branchname> entry, it links to git help git-check-ref-format which has a much fuller list, but that should also be referenced from the git help revisions guide..

The difference between a rev(ision) and a ref(eference) is a bit of a sliding scale.. [1]

[1] https://stackoverflow.com/questions/11792538/in-git-what-is-the-difference-between-a-commits-and-a-revisions

Philip

prati0100 commented 4 years ago

Technically, those who are exactly following the spec might have code sequences that aren't valid UTF-8. Not to suggest that they might be intentionally loading it with broken sequences, but e.g. to follow the theme of this thread, if they have their console set to Shift-JIS encoding, then I would expect git on the command-line to display all the characters properly with Shift-JIS in the ref names. Git doesn't care what the characters are, just that they are properly null-terminated, and if it copies Shift-JIS data from the ref names to the console and the terminal is properly configured for it, then it'll display properly.

From man gitrevisions:

While the ref name encoding is unspecified, UTF-8 is preferred as some output processing may assume ref names in UTF-8.

It was introduced by 1452bd64f14.

So Git does recommend UTF-8, but does not restrict other encodings.

If we do not take this patch, Tcl will default the encoding to the system default. So, people using something like Shift-JIS as their system encoding but UTF-8 as their refname encoding [0] will not be able to use git-gui properly.

In the end, I think the best compromise would be to default to UTF-8 but allow changing the encoding via a config variable. Thoughts?

[0] The thing that puzzles me is since references are just files under the .git directory, shouldn't they always be in the system encoding? If the system encoding is Shift-JIS, can a user still save filenames encoded in UTF-8? But apparently you can, hence this pull request.

logiclrd commented 4 years ago

The thing that puzzles me is since references are just files under the .git directory, shouldn't they always be in the system encoding? If the system encoding is Shift-JIS, can a user still save filenames encoded in UTF-8? But apparently you can, hence this pull request.

I'm pretty sure what this means is that the references themselves don't intrinsically have an encoding. Their data is code points that the file system stores somehow, and when you use the OS API to query the filesystem, it puts it into an encoding that is (hopefully) a superset of what the file system supports, whatever that is, or at least a superset of the characters you are using. E.g., on Windows the API returns UTF-16 strings (unless you use the A version of the API functions, but who would do that? :-P).

But then, when you have another file that refers to a ref, such as .git/HEAD, the contents of that file are just a byte sequence as far as the OS is concerned. Now it's up to Git how to encode it. And that's where most of Git presently says "anything goes, as long as I can put it in a null-terminated char *". (Git independently says that refs can't contain a certain set of characters, including control characters and such, but that is enforced where the refs actually exist, and the fact that things pointing at the refs can't contain those characters is a consequence of that, not its own rule.) As you point out, though, Git recommends that UTF-8 be used for that. The question, though, is where does Git get those char * sequences in the first place?

To sum it up, the actual refs are filenames, and the encoding of filenames is an implementation detail of Git. But, when other parts of Git refer to these refs, Git has to decide how it is going to encode those filenames, which might have extended characters in them. As far as I can tell, the part of Git that writes out files like .git/HEAD simply doesn't take encoding into account, and writes the raw bytes that are in the variable without translation.

So where do those char *s come from? Well, they come over the network from other systems, they come from the filesystem, and they come from the command-line. (Any others I'm not thinking of?)

As for the filesystem, based on the differing results on Windows vs. Linux in my first round of testing, I was leaning toward the suspicion that Git on Windows used the system encoding exclusively. Someone with a Japanese Windows installation would be getting Shift-JIS values back from things like readdir. I just did some further testing, though, and I've found that even on Windows, Git seems to be receiving filenames from things like readdir in UTF-8. I now attribute my inconsistent results to the fact that my test was specifically invalid UTF-8, and it's quite conceivable that that fails to round-trip on Windows.

I used a Unicode console to PowerShell (which processes everything internally using Windows' standard UTF-16 strings) and created a branch named "漢字" (the Japanese word "kanji"), and the Windows build processed this just fine.

^-- When Git received its command-line from the parent shell (PowerShell), it will almost certainly have been a UTF-16 string (I don't think PowerShell ever uses 8-bit characters for anything internally). In conversion from that to the argv that Git received from the C runtime in main, it seems to have been translated to UTF-8. All of the strings Git passed around internally for the ref name, then, were char * values that happened to be UTF-8 encoded. The same string (presumably) was passed to the C runtime file I/O functions to create the file in .git/refs/heads, and the UTF-8 was converted by the C runtime without any loss of fidelity for use by the underlying OS file system API.

When Git wrote its console output back out, though, it wrote UTF-8 but the console interpreted it as ISO-8859-1. (Also PowerShell really doesn't like the stderr stream. :-P That's why the Switched to branch text got flagged as an "error".)

The repository is in a perfectly-consistent state, and I can do operations from non-Unicode terminals (as long as I don't have to type the branch name :-) )

^-- The Windows console does not grok UTF-8.

^-- I'm pretty sure the pair of "unknown" glyphs mean that the Ubuntu tty is converting the UTF-8, I'm just not using a font with the necessary code points. :-)

So, when Git gets ref names from the filesystem, it is in fact getting UTF-8 consistently. When Git gets ref names from other systems over the network, it's subject to where those systems got the ref names from.

Based on this, it seems that the main issue is that Git command-line allows you to supply strings that are not valid UTF-8.

I am about 85% confident in this analysis. There is definitely room for error. :-)

PhilipOakley commented 4 years ago

I am about 85% confident in this analysis. There is definitely room for error. :-)

I think I'd agree, but there are potential phraseology errors. The potential is that the branch names need to complete a full cycle of coding/recoding, which... ... leaves open the potential for misunderstandings about start point and directions of the coding/recoding (I deliberately avoiding the 'encoding' phrase ;-)

I start with the git ref being a simple nul terminated byte string held in memory (equivalently, within a file's byte stream). This is them processed by git, which presumes unless told otherwise, that the string is a Linux path for the purpose of saving the ref's content in a file 'of the same name'. We don't care about the value of that ref content in this discussion (yet?)

When git wants to display the ref's name (the string) it then calls it's displaying UI, and usually expects that the UI will display it as if it is utf-8. In some cases can pass forward some encoding type to the UI - but why (would it need that).. step back two paces.

The ref's content should have been stored in a file 'of the same name', but the OS (if it's not Linux, e.g. tcl or Windows, or both) could screw that up such that the OS displays the name in different glyphs (of the putative chars) to that of the utf8 expectation. This can be made worse if the OS allows different char representations. Worse, the linux path separator may be different adding to the confusion (hence some of the name restrictions git applies)...

To use the host OS's path representation of the ref is probably the wrong place to start, especially when it may not have a canonical representation.

So the UI needs to convert user entered ref names into a utf8 string, and have a bidirectional transfer of that string value to the FS, and the same in reverse for the displaying the ref name from git back to the user.

So, when Git gets ref names from the filesystem, it is in fact getting UTF-8 consistently. When Git gets ref names from other systems over the network, it's subject to where those systems got the ref names from.

Based on this, it seems that the main issue is that Git command-line allows you to supply strings that are not valid UTF-8.

so, not withstanding phraseology [the cli is a UI], I agree

logiclrd commented 4 years ago

I mean, it seems to me the real question is, can the string in that char * be round-tripped through the filesystem as an object name? If it can't, then it isn't valid, because Git can't operate properly. Couple that with the fact that, experimentally, it seems that the interface to the filesystem on all platforms is using UTF-8 encoding (regardless of whether the underlying storage is physically), I think that in practice UTF-8 is the only valid encoding for the ref names.

If UTF-8 is the only valid encoding for the ref names, then supplying a string on the command-line that is not valid UTF-8 should be rejected with no further processing.

prati0100 commented 4 years ago

FWIW, I asked Junio this question. Here's his response:

Quite honestly, as long as we can keep the core part to stay "pathnames and refnames are just runs of bytes, terminated by NUL, and no other interpretation is necessary", I do not think I care too deeply if the UI layers (either GUI or Porcelain) become a bit more restrictive.

But that is just my gut feeling.

PhilipOakley commented 4 years ago

then supplying a string on the command-line that is not valid UTF-8 ... (~logiclrd)

I think we are missing the fact that there can be a TUI (terminal user interface) or GUI between the user's keyboard, and the string that git sees, and that these intermediate interface layers (some of which might be null) could well have (re-)encodings applied to them.

If the user types or pastes the string, then it's an interface UI. Trying to nail down (especially for internatalionalised users) a non re-encoded method is hard. Hence all the problems.

logiclrd commented 4 years ago

But right now, it's not documented at all. If the documentation clearly says, "If the command-line is supplied as an 8-bit encoding, it must be encoded using UTF-8", then those tools that aren't doing it today know what they need to change, and a new Git version that rejects what they're currently sending provides a clear signal that they need to make changes.

prati0100 / git-gui

git branch name encoding UTF-8 #21

[1] https://stackoverflow.com/questions/11792538/in-git-what-is-the-difference-between-a-commits-and-a-revisions