microsoft / sarif-sdk

.NET code and supporting files for working with the 'Static Analysis Results Interchange Format' (SARIF, see https://github.com/oasis-tcs/sarif-spec)
Other
191 stars 88 forks source link

Replace file encodings #2741

Closed michaelcfanning closed 8 months ago

michaelcfanning commented 8 months ago

Code coverage: 100% (this really makes the point of how critical data coverage is as well as code coverage. :)

This change pivots our textual data detection mechanism. What we do now is attempt to decode as UTF8, Windows 1252 and UTF32. If any of these attempts results in a Unicode replacement character or observes an embedded NUL (after the first character, where a textual BOM might generate one), we classify the data as binary.

Otherwise it's text.

This utility is running against a 3M file test data set. It is looking generally effective. I am concerned about performance, we need to look at this closely.