Open LinqLover opened 3 years ago
Can you confirm for me please that this is UTF-8 encoded?
What do you mean by this? This is the relevant GitHub URL: https://github.com/hpi-swa-lab/squeak-inbox-talk/tree/expos%C3%A9
I don't know how the GitHub APIs handle encodings.
(I hope I will be able to officially announce the project in a few weeks. :-))
It is indeed UTF-8. https://www.fileformat.info/info/unicode/char/00e9/index.htm
#utf8ToSqueak
fixes the string.
This is tricky: since Git does not care about the encoding of the ref filename, you could also have branch names in other encodings.
For example, I have both versions now:
(copy&paste template for future attempts: exposé)
So sending utf8ToSqueak to all those strings is not really correct.
However in my case the file also looks utf-8-garbled in Windows explorer. What is the situation in your case? Is your case on Windows or on Linux (or WSL)? Did you clone with Squit or with command line Git?
When I pushed my Windows-é branch, I ended up with this: https://github.com/j4yk/Squot/tree/expos%E9
Sorry for the late reply. I have created the repository branch using VS Code on Windows. Later I have cloned the repository again directly from the Squit browser using an image running on Windows as well. My file structure in the .git
folder created by Squot contains only the exposé
version.
How does the ref filename look in the repository created by VS Code?
I guess the only thing we can do here is adhere to how the original Git displays such encoded branch names. Please show me your git branch
output for your repositories with exposé. Also, please create another branch with a special character using the Git CLI and post how that file name ended up encoded under Windows and from WSL. Finally, please try to create two branches that are actually the same, but with different encodings (like my two exposé branches above) and report how they show up in git branch
.
Looks like the Git CLI somehow does it correctly:
(exposé branch latin-1 encoded, testé branch utf-8-encoded by GitHub)
Jakob@JAKOBS-PC MINGW64 /c/Squeak/MA/Squot-Squit (imposé)
$ git fetch j4yk
remote: Enumerating objects: 4, done.
remote: Counting objects: 100% (4/4), done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 3 (delta 1), reused 1 (delta 0), pack-reused 0
Unpacking objects: 100% (3/3), 636 bytes | 31.00 KiB/s, done.
From https://github.com/j4yk/Squot
* [new branch] exposÚ -> j4yk/exposÚ
* [new branch] test├® -> j4yk/test├®
Jakob@JAKOBS-PC MINGW64 /c/Squeak/MA/Squot-Squit (imposé)
$ git branch -a | grep -E test\|pos
* imposé
latest-release
silence-mc-tests
remotes/j4yk/exposé
remotes/j4yk/testé
remotes/origin/latest-release
Jakob@JAKOBS-PC MINGW64 /c/Squeak/MA/Squot-Squit.git (GIT_DIR!)
$ cmd
C:\Squeak\MA\Squot-Squit.git>dir refs\remotes\j4yk
dir refs\remotes\j4yk
Directory of C:\Squeak\MA\Squot-Squit.git\refs\remotes\j4yk
30.05.2021 11:47 <DIR> .
30.05.2021 11:47 <DIR> ..
08.04.2021 22:48 41 develop
21.05.2021 22:03 41 exposé
08.04.2021 14:10 <DIR> feature
16.12.2020 03:12 <DIR> test
30.05.2021 11:47 41 testé
3 File(s) 123 bytes
4 Dir(s)
After some playing around without Squeak, I find that other tools, including the Git CLI and VS Code, always create the ref file names with the correct encoding, that is you see the é or other special characters in the file explorer correctly. If you fetch a branch that appears correctly on GitHub, it will end up in the packed-refs file, utf-8-encoded, but using git checkout --track origin/exposé, for example, Git will create a local branch of the same name by itself, and it looks correctly in the file explorer.
How they do it exactly I will no further pursue, since on the one hand it is said that Git treated the names as just bytes, while on the other hand what appears in argv[] depends on the console codepage. Windows Git might also use GetCommandLineW() or check the current codepage... But GitHub does not allow to search the code of forks, and the repository for Windows Git is a fork.
NTFS encoding is always UTF-16, so nothing that appears during fetch or push on the wire. So we can assume that a file name of exposé under refs/heads is in fact no longer a branch called "exposé", and is therefore corrupted.
git branch
lists the branches in UTF-8, which can be seen if run from cmd or Powershell (appears as expos<C3><A0>
, for example, where the substitutes in <> actually use inverse coloring, i. e. black text on colored background). The names look just right and unobstructed in the mintty terminal employed by "Git Bash".
Git does not recognize my latin-1-encoded branch with git checkout --track origin/exposé
. Probably it cannot match what it got from the command line with the ref in the packed-refs file.
PS C:\Squeak\Squot-Bugfixes\ref-encoding\Squot> git checkout --track origin/exposé
fatal: 'origin/exposé' is not a commit and a branch 'exposé' cannot be created from it
PS C:\Squeak\Squot-Bugfixes\ref-encoding\Squot> git branch -a
* f<C3><B3>obar
remotes/origin/expos<E9>
When you create a new branch called exposé with Squot, it already ends up encoded correctly (i. e. it shows correctly in the file explorer and the Git command line outputs it in UTF-8).
So I will have a look at the code for interpreting the packed-refs file in FileSystem-Git. It probably does not use UTF-8 encoding when it should.
Thank you for your efforts, and sorry again for the delay ... I need to keep a better overview of my inbox. :-)
How does the ref filename look in the repository created by VS Code?
exposé
Please show me your
git branch
output for your repositories with exposé.
$ git branch
* exposé
main
mas
personae
Also, please create another branch with a special character using the Git CLI and post how that file name ended up encoded under Windows and from WSL.
Just like I type it into the shell, i.e. the è
is preserved and displayed correctly under both Windows and WSL and also in VS Code.
Finally, please try to create two branches that are actually the same, but with different encodings (like my two exposé branches above) and report how they show up in
git branch
.
Both with the original encoding I have used for their names. Looks like there is no encoding problem outside of Squot on my end. :-)
Windows Git might also use GetCommandLineW() or check the current codepage... But GitHub does not allow to search the code of forks, and the repository for Windows Git is a fork.
(Side note: github1s & Ctrl + Shift + F. Or use Gitpod. And there is also a new remote code extension for VS Code ... :-))
There were some encoding issues about the stream used to read and write the packed-refs file. Please check whether the problem with your exposé branch persists now.
Tested it myself with a new branch, looks like it is not fixed yet. The branch is not decoded when a repository is cloned and the ref file in the local repository is created with a wrong name.
The ref names were not decoded during fetches (or when reading the push report), and not encoded during pushes, leading to different information in Squeak vs. on GitHub.
Now a repository with a branch with é gets cloned correctly, and if I push a new branch with an è that also shows correctly on GitHub.
Please kindly check once more @LinqLover whether that resolves the problem for you as well.
[...] since Git does not care about the encoding of the ref filename, you could also have branch names in other encodings.
I don't think so. We must ensure UTF8 when writing or reading branch names. There is a student project that now has an inaccessible branch because Squot wrote the invalid E4 byte (Codepoint for ä, Latin-1/Unicode) even though C3 A4 was expected.
Both github.com and my local git CLI expects UTF8. Not sure how you would change, for example, to Latin-1 (single byte) encoding just for all the branches API...
'ä' encodeForHTTP. " '%C3%A4' "
'ä' squeakToUtf8 collect: #hex as: Array. " #('C3' 'A4') "
#[16rE4] asString utf8ToSqueak. "Error"
#[16rC3 16rA4] asString utf8ToSqueak. " 'ä' "
Just to be clear, the student likely used an older version of Squot (the second latest release).
https://github.com/hpi-swa/Squot/search?q=squeakToUtf8
Well, I would expect some call to squeakToUtf8
as soon as any Squeak String is used as data. Hmmm...
Ah, here is an update from Dec 2021. Might have fixed it: https://github.com/hpi-swa/Squot/search?q=Utf8TextConverter
The actual branch name as shown in the web interface is
exposé
:Disclaimer: The project was cloned ~10 days ago. If you have made any encoding-relevant changes in the meantime, I can try to reproduce the issue with the newest develop version. :-)