Fixing textual classifier's handling of misaligned data

The classifier attempts to decode an input buffer as both an Uft32 and Utf16. It takes a count of "characters" to read. In practice, this seems to be bytes to read, and callers tend to pass in the buffer length. This works fine when the byte[] is sized to an array of length equivalent to zero modulo four (the number of bytes in a UTF32 buffer.) But when the alignments are less favorable, it is possible for one encoding to return a false-binary detection, and the other to partially read off the end of the buffer in a way that fails the decoding.

Proposed fix is to have the classifier ensure the decoding is aligned to zero modulo 4.

microsoft / sarif-sdk

Fixing textual classifier's handling of misaligned data #2780