linuxmint / cinnamon

A Linux desktop featuring a traditional layout, built from modern technology and introducing brand new innovative features.
GNU General Public License v2.0
4.48k stars 728 forks source link

Detecting Malicious Unicode in Source Code and Pull Requests #10981

Open TNTBOMBOM opened 2 years ago

TNTBOMBOM commented 2 years ago

I couldn't find any proper place to report this, feel free to shift it if this is not the right place.

Quote [https://trojansource.codes/ Trojan Source: Invisible Vulnerabilities]

Invisible Source Code Vulnerabilities

Some Vulnerabilities are Invisible Rather than inserting logical bugs, adversaries can attack the encoding of source code files to inject vulnerabilities.

These adversarial encodings produce no visual artifacts.

The trick

The trick is to use Unicode control characters to reorder tokens in source code at the encoding level. These visually reordered tokens can be used to display logic that, while semantically correct, diverges from the logic presented by the logical ordering of source code tokens. Compilers and interpreters adhere to the logical ordering of source code, not the visual order.

The attack

The attack is to use control characters embedded in comments and strings to reorder source code characters in a way that changes its logic. ... Adversaries can leverage this deception to commit vulnerabilities into code that will not be seen by human reviewers. This attack pattern is tracked as CVE-2021-42574.

CVE-2021-42574 at redhat

The supply chain

This attack is particularly powerful within the context of software supply chains. If an adversary successfully commits targeted vulnerabilities into open source code by deceiving human reviewers, downstream software will likely inherit the vulnerability.

The technique

There are multiple techniques that can be used to exploit the visual reordering of source code tokens:

  • Early Returns cause a function to short circuit by executing a return statement that visually appears to be within a comment.
  • Commenting-Out causes a comment to visually appear as code, which in turn is not executed.
  • Stretched Strings cause portions of string literals to visually appear as code, which has the same effect as commenting-out and causes string comparisons to fail.

The variant

A similar attack exists which uses homoglyphs, or characters that appear near identical.

...

The above example defines two distinct functions with near indistinguishable visual differences highlighted for reference. An attacker can define such homoglyph functions in an upstream package imported into the global namespace of the target, which they then call from the victim code. This attack variant is tracked as CVE-2021-42694.

The defense

  • Compilers, interpreters, and build pipelines supporting Unicode should throw errors or warnings for unterminated bidirectional control characters in comments or string literals, and for identifiers with mixed-script confusable characters.
  • Language specifications should formally disallow unterminated bidirectional control characters in comments and string literals.
  • Code editors and repository frontends should make bidirectional control characters and mixed-script confusable characters perceptible with visual symbols or warnings.

The paper

Complete details can be found in the related [https://trojansource.codes/trojan-source.pdf paper].

By authors Nicholas Boucher and Ross Anderson, 2021, [https://arxiv.org/abs/2111.00169 arXiv].

tasks:

references:

ItzSwirlz commented 2 years ago

This should be a part of the scan-build project w/ Clang.