Can not pull if a very old commit message contains Non-breaking space

pharo-vcs / iceberg

Iceberg is the main toolset for handling VCS in Pharo.

MIT License

134 stars 85 forks source link

Can not pull if a very old commit message contains Non-breaking space #1520

Open syrel opened 3 years ago

syrel commented 3 years ago

Hi,

In one of the respositories there is a very old commit (from 2018) with a non-breaking space in the commit message that breaks Iceberg: https://github.com/feenkcom/gtoolkit-phlow/tree/3bf1a9728c47a834e105943d67a2e7dd0f01fb10

This leads to inability to pull new commits because non-breaking space is for some reason considered as an illegal utf8 sequence. Why would Iceberg load such old commits when they are 700 commits behind the HEAD?

Here are the bytes that git_commit_message returns:

#(91 116 97 115 107 105 116 93 160 104 97 110 100 108 101 32 101 120 97 99 116 44 32 115 111 109 101 44 32 97 110 100 32 97 110 121 32 112 114 111 103 114 101 115 115 32 110 111 116 105 102 105 99 97 116 105 111 110) asByteArray utf8Decoded

syrel commented 3 years ago

Update: After a bit of digging, we found that git_commit_message does not guarantee the returned message to be encoded in UTF-8. In fact:

Commit log messages are typically encoded in UTF-8, but other extended ASCII encodings are also supported. This includes ISO-8859-x, CP125x and many others, but not UTF-16/32, EBCDIC and CJK multi-byte encodings (GBK, Shift-JIS, Big5, EUC-x, CP9xx etc.). (https://git-scm.com/docs/git-commit/2.13.7#_discussion)

And this is what we see, the encoding of the previously mentioned commit is in fact ISO-8859-1. However, the LGitLibrary bindings indirectly assume it is always UTF8:

LGitLibrary>>#commit_message: commit
    ^ self ffiCallSafely: #(String git_commit_message #(self)) options: #()

The return type is declared as String which means it is converted from the ExternalAddress via: ExternalData(ByteArray)>>readStringUTF8.

Any ideas on how to fix it?

tinchodias commented 3 years ago

Hi. I think it's the same issue reported as https://github.com/pharo-vcs/iceberg/issues/1426

hellerve commented 2 years ago

I have a fix ready that essentially does this:

When reading the message, try to use the encoding provided in the Git commit header. If that header does not provide the correct encoding (i.e. we get an encoding error), just return the string as-is, since it’s the best we can do.

I’d be happy to contribute this back into Iceberg. Is this approach what we envisioned? If so, what target branch should I use?

hellerve commented 2 years ago

I pushed the fix so it can be inspected for whether it’s desirable in its current state or not.

Ducasse commented 2 years ago

Thanks a lot we will have a look.