Open syrel opened 3 years ago
Update: After a bit of digging, we found that git_commit_message
does not guarantee the returned message to be encoded in UTF-8. In fact:
Commit log messages are typically encoded in UTF-8, but other extended ASCII encodings are also supported. This includes ISO-8859-x, CP125x and many others, but not UTF-16/32, EBCDIC and CJK multi-byte encodings (GBK, Shift-JIS, Big5, EUC-x, CP9xx etc.). (https://git-scm.com/docs/git-commit/2.13.7#_discussion)
And this is what we see, the encoding of the previously mentioned commit is in fact ISO-8859-1
. However, the LGitLibrary
bindings indirectly assume it is always UTF8:
LGitLibrary>>#commit_message: commit
^ self ffiCallSafely: #(String git_commit_message #(self)) options: #()
The return type is declared as String
which means it is converted from the ExternalAddress
via: ExternalData(ByteArray)>>readStringUTF8
.
Any ideas on how to fix it?
Hi. I think it's the same issue reported as https://github.com/pharo-vcs/iceberg/issues/1426
I have a fix ready that essentially does this:
When reading the message, try to use the encoding provided in the Git commit header. If that header does not provide the correct encoding (i.e. we get an encoding error), just return the string as-is, since it’s the best we can do.
I’d be happy to contribute this back into Iceberg. Is this approach what we envisioned? If so, what target branch should I use?
I pushed the fix so it can be inspected for whether it’s desirable in its current state or not.
Thanks a lot we will have a look.
Hi,
In one of the respositories there is a very old commit (from 2018) with a non-breaking space in the commit message that breaks Iceberg: https://github.com/feenkcom/gtoolkit-phlow/tree/3bf1a9728c47a834e105943d67a2e7dd0f01fb10
This leads to inability to pull new commits because non-breaking space is for some reason considered as an illegal utf8 sequence. Why would Iceberg load such old commits when they are 700 commits behind the HEAD?
Here are the bytes that
git_commit_message
returns: