NEW: FileEncoding.IsTextualData utility can effectively distinguish between binary and textual data.
Code coverage: 100% (this really makes the point of how critical data coverage is as well as code coverage. :)
This change pivots our textual data detection mechanism. What we do now is attempt to decode as UTF8, Windows 1252 and UTF32. If any of these attempts results in a Unicode replacement character or observes an embedded NUL (after the first character, where a textual BOM might generate one), we classify the data as binary.
Otherwise it's text.
This utility is running against a 3M file test data set. It is looking generally effective. I am concerned about performance, we need to look at this closely.
FileEncoding.IsTextualData
utility can effectively distinguish between binary and textual data.Code coverage: 100% (this really makes the point of how critical data coverage is as well as code coverage. :)
This change pivots our textual data detection mechanism. What we do now is attempt to decode as UTF8, Windows 1252 and UTF32. If any of these attempts results in a Unicode replacement character or observes an embedded
NUL
(after the first character, where a textual BOM might generate one), we classify the data as binary.Otherwise it's text.
This utility is running against a 3M file test data set. It is looking generally effective. I am concerned about performance, we need to look at this closely.