ko4life-net / ko

Open source development of the game Knight Online. This is a reversed engineered old version of the game aiming to replicate the nostalgic experience we all once had <3
MIT License
52 stars 21 forks source link

.gitattributes working-tree-encoding bugs with UTF-16 encoding #243

Closed stevewgr closed 1 month ago

stevewgr commented 1 month ago

Description

As noted in my comment here: https://github.com/ko4life-net/ko/pull/241#issuecomment-2241447558

  "C:\Program Files\Git\bin\git.exe" checkout --progress --force -B gitattributes-utf16-fix refs/remotes/origin/gitattributes-utf16-fix
  Error: error: failed to encode 'src/game/res/Resource.rc' from UTF-8 to UTF-16
  Error: error: failed to encode 'src/server/AIServer/res/AIServer.rc' from UTF-8 to UTF-16
  Error: error: failed to encode 'src/server/AIServer/res/AIServer.rc2' from UTF-8 to UTF-16
  Error: error: failed to encode 'src/server/Aujard/Aujard.rc' from UTF-8 to UTF-16
  Error: error: failed to encode 'src/server/Aujard/res/Aujard.rc2' from UTF-8 to UTF-16
  Error: error: failed to encode 'src/server/Ebenezer/Ebenezer.rc' from UTF-8 to UTF-16
....

it appears that the .gitattributes configuration for working-tree-encoding is not functioning as expected. This issue is being raised for those interested in addressing it, as well as for those who have previously reported similar problems, including:

and also because git's bug reporting sucks (in hope that one of them see's this), reminiscent of outdated practices.

You might wonder why Git attempts to convert from UTF-8 to UTF-16 instead of the other way around, especially when the files are confirmed to be UTF-16LE with BOM. Well, this log message from git sucks, really... digging further into their source code: https://github.com/git/git/commit/107642fe2661236f48b912480e090799e339c512

It is what they call round-trip encoding, as detailed in their documentation on working-tree-encoding and the source code commit. This separated into two parts:

Now for git to render the diffs they need to be stored in the git index as UTF-8, otherwise they're considered as binary and therefore diffing won't work. .gitattributes is here for the rescue! oh wait, not actually... When we configure .gitattributes file with *.rc text working-tree-encoding=UTF-16LE-BOM, this helps git to know the original encoding of the file, so that internally it can use libiconv to properly transcode from the original encoding to UTF-8 and store it internally in the git index. But when we checkout the file via git, it will take the UTF-8 encoded from the git index and transcode it into the original UTF-16LE-BOM encoding to the git worktree. This explains the log message above in which stage it failed to transcode (encode_to_worktree).

Git's implementation of working-tree-encoding seems mostly correct, but issues arise during the checkout stage. Specifically, the tracked file becomes corrupted during the round-trip transcoding process. See my proof of concept (POC) below for details.

Why is this an issue for us? Because if someone submits a pull request with changes to any *.rc files, we won’t be able to view the diffs directly. Instead, we would need to check the file locally using tools that can render the diff properly. Also we're stricted to use UTF-16 for *.rc files since that's the official encoding Visual Studio uses in the Resource Editor to properly work with them.

Proposal

To Reproduce

Use the following commands to reproduce this issue in Git:

# git version 2.45.2.windows.1
git --version

# open Git Bash
mkdir /tmp/git-bug-poc && cd /tmp/git-bug-poc
git init

echo 'hello' | iconv -f UTF-8 -t UTF-16LE > text.utf16
git add text.utf16 && git commit -m initial

# append changes to text.utf16
echo 'world' | iconv -f UTF-8 -t UTF-16LE >> text.utf16
git commit -m diff1 -a

# create the .gitattributes
echo '*.utf16 text working-tree-encoding=UTF-16LE-BOM eol=LF' > .gitattributes
git add .gitattributes && git commit -m gitattributes

# append changes to text.utf16
echo 'broken' | iconv -f UTF-8 -t UTF-16LE >> text.utf16
git commit -m diff2 -a

# reproduce the bug
# note that git bash will silently corrupt the file, whilst on powershell it will actually report the error:
# error: failed to encode 'text.utf16' from UTF-8 to UTF-16LE-BOM
# and leave you with a broken worktree where you can't discard the changes unless you git reset --hard to the commit
# before adding the .gitattributes file.
git checkout HEAD~1

# fix the corruption
git reset --hard HEAD~1

Tasks

stevewgr commented 1 month ago

pre-commit script done:

#!/bin/bash

echo 'Running pre-commit hook...'

git reset '*-diff.rc' '*-diff.rc2'
git checkout -- '*-diff.rc' '*-diff.rc2'
git clean -df '*-diff.rc' '*-diff.rc2'

rc_files=$(find . -type f \( -name '*.rc' -o -name '*.rc2' \) ! -name '*-diff.rc' ! -name '*-diff.rc2')
for rc_file in $rc_files; do
  extension="${rc_file##*.}"
  diff_file="${rc_file%.*}-diff.${extension}"
  echo '// ** DO NOT TOUCH THIS FILE! AUTO GENERATED FOR DIFFING PURPOSES. **' > "$diff_file"
  iconv -f UTF-16LE -t UTF-8 "$rc_file" | awk 'NR==1{sub(/^\xEF\xBB\xBF/,"")}1' >> "$diff_file"
done

git add '*-diff.rc' '*-diff.rc2'

exit 0

I'll continue working tomorrow on integrating it into the config script.