Detecting Malicious Unicode in Source Code and Pull Requests

I couldn't find any proper place to report this, feel free to shift it if this is not the right place.

Quote [https://trojansource.codes/ Trojan Source: Invisible Vulnerabilities]

Invisible Source Code Vulnerabilities

Some Vulnerabilities are Invisible Rather than inserting logical bugs, adversaries can attack the encoding of source code files to inject vulnerabilities.

These adversarial encodings produce no visual artifacts.

The trick

The trick is to use Unicode control characters to reorder tokens in source code at the encoding level. These visually reordered tokens can be used to display logic that, while semantically correct, diverges from the logic presented by the logical ordering of source code tokens. Compilers and interpreters adhere to the logical ordering of source code, not the visual order.

The attack

The attack is to use control characters embedded in comments and strings to reorder source code characters in a way that changes its logic. ... Adversaries can leverage this deception to commit vulnerabilities into code that will not be seen by human reviewers. This attack pattern is tracked as CVE-2021-42574.

CVE-2021-42574 at redhat

The supply chain

This attack is particularly powerful within the context of software supply chains. If an adversary successfully commits targeted vulnerabilities into open source code by deceiving human reviewers, downstream software will likely inherit the vulnerability.

The technique

There are multiple techniques that can be used to exploit the visual reordering of source code tokens:

Early Returns cause a function to short circuit by executing a return statement that visually appears to be within a comment.

Commenting-Out causes a comment to visually appear as code, which in turn is not executed.

Stretched Strings cause portions of string literals to visually appear as code, which has the same effect as commenting-out and causes string comparisons to fail.

The variant

A similar attack exists which uses homoglyphs, or characters that appear near identical.

...

The above example defines two distinct functions with near indistinguishable visual differences highlighted for reference. An attacker can define such homoglyph functions in an upstream package imported into the global namespace of the target, which they then call from the victim code. This attack variant is tracked as CVE-2021-42694.

The defense

Compilers, interpreters, and build pipelines supporting Unicode should throw errors or warnings for unterminated bidirectional control characters in comments or string literals, and for identifiers with mixed-script confusable characters.

Language specifications should formally disallow unterminated bidirectional control characters in comments and string literals.

Code editors and repository frontends should make bidirectional control characters and mixed-script confusable characters perceptible with visual symbols or warnings.

The paper

Complete details can be found in the related [https://trojansource.codes/trojan-source.pdf paper].

By authors Nicholas Boucher and Ross Anderson, 2021, [https://arxiv.org/abs/2111.00169 arXiv].

tasks:

[ ] check if potential existing compromises: scan all distribution source code for existing unicode
[ ] educate existing and future distribution source code reviewers: add a distribution source code reviewer policy to a github repository or on the distribution website which existing and future reviewers need to acknowledge that I understand the issue. More of a reminder, a conversation starter.
[ ] remove as much unicode from distribution source code as possible: by reducing the amount of unicode in distribution source code, audits for malicious unicode with automated tools gets simpler. If possible, if unicode is considered essential, instead of writing ® when required it should be encoded as ®.
[ ] local check by reviewer: document tools that distribution source code reviewers could/should use to scan future contributions for malicious unicode
[ ] remote cursory check: add a github pull request hook that notifies when unicode is included in a pull request (This is just an additional, handy layer of protection. Since infrastructure should be distrusted this alone is not a full solution.)
[ ] build scripts / CI scripts: should check if there is unicode in any files except in opt-in expected files. If there is unexpected unicode, the build should error out.
[ ] scan upstream projects source code: check if these are compromised by malicious unicode
[ ] notify upstream projects: these might not be aware of this issue and already compromised by malicious unicode.

references:

linuxmint / cinnamon

Detecting Malicious Unicode in Source Code and Pull Requests #10981