microsoft / sarif-sdk

.NET code and supporting files for working with the 'Static Analysis Results Interchange Format' (SARIF, see https://github.com/oasis-tcs/sarif-spec)
Other
193 stars 91 forks source link

Fixing textual classifier's handling of misaligned data #2780

Closed scottoneil-ms closed 7 months ago

scottoneil-ms commented 7 months ago

The classifier attempts to decode an input buffer as both an Uft32 and Utf16. It takes a count of "characters" to read. In practice, this seems to be bytes to read, and callers tend to pass in the buffer length. This works fine when the byte[] is sized to an array of length equivalent to zero modulo four (the number of bytes in a UTF32 buffer.) But when the alignments are less favorable, it is possible for one encoding to return a false-binary detection, and the other to partially read off the end of the buffer in a way that fails the decoding.

Proposed fix is to have the classifier ensure the decoding is aligned to zero modulo 4.