Add binary vs. textual file enumeration.

michaelcfanning commented 8 months ago

BRK: EnumeratedArtifact now sniffs artifacts to distinguish between textual and binary data. The Contents property will be null for binary files (use Bytes instead).
BRK: EnumeratedArtifact raises InvalidOperationException for non-seekable streams (unless the SupportNonSeekableStreams property is set). Non-seekable streams result in performance inefficiencies when expanding.
BRK: MultithreadedZipArchiveArtifactProvider now distinguishes binary vs. textual data using a hard-coded binary files extensions list. This data will be made configurable in a future change.
NEW: EnumeratedArtifact now automatically detects and populates a Bytes property for binary files such as executables and certificates.

Previously, artifact enumeration assumed all data consisted of text. This breaks attempts to load/examine binary data.

In EnumeratedArtifact, we now examine the beginning of all seekable streams to detect whether we are seeing textual or binary data. We do so using an OSS snippet that looks for null/control characters in that data. This code also attempts to retrieve a range of text file encodings but this capability is not verified. Binary files are retrieved as byte arrays. Text files go to the existing Contents property.

If we have a non-seekable stream, what we have to do is retrieve the file as a byte array and then subsequently convert it to text if we determine that it is textual. This is a performance degradation from our previous implementation and so we require users to explicitly set a SupportNonSeekableStreams property to ensure they know what they are opting into.

Separately, the MultithreadedZipArchiveArtifactProvider does enumerate non-seekable streams, since they comprise compressed data. This provider uses a specialized ZipArchiveArtifact kind. Rather than accepting the costs to double-transform a text stream to bytes and then a decoded string, we simply hard-code a set of binary file extensions and only retrieve these as bytes. This property is settable. This isn't a really ideal solution but works for our immediate scenarios. You could imagine updating this class to optionally allow for the non-performant behavior (expand as bytes, sniff to classify as binary or text, convert text to strings). This would allow for higher fidelity classification without requiring advance knowledge of file extensions.

Code coverage information:

94.38% : EnumeratedArtifact 97.22% : MultithreadedZipArtifactProvider 93.59% : ZipArchiveArtifact

scottoneil-ms commented 8 months ago

MemoryStream is backed by a byte[]. Presumably, they grow it exponentially. But the constructor takes a size argument so you can presize it correctly in the case that we can read Length. Because it's just a byte[] under the covers, MemoryStream is about the same cost as a byte[]. You can get the raw underlying buffer via GetBuffer(). But if it wasn't sized precisely, it will tend to be bigger than the total data. So in the case where we can read .Length the cost is about the same.

You're probably hating the bad abstraction -which... fair.

If you're sure we can always read .Length then it doesn't get us much. Once we have a pre-sized byte[] in memory, we can just wrap that in a MemoryStream if we ever want to treat it as a Stream. (ie: to do an async stream copy, decode it as text --whatever.)

From: Michael C. Fanning @.> Sent: Thursday, November 30, 2023 5:10 PM To: microsoft/sarif-sdk @.> Cc: Scott O'Neil (ALLEGIS GROUP HOLDINGS INC) @.>; Comment @.> Subject: Re: [microsoft/sarif-sdk] Add binary vs. textual file enumeration. (PR #2739)

@michaelcfanning commented on this pull request.

In src/Sarif/ZipArchiveArtifact.cshttps://github.com/microsoft/sarif-sdk/pull/2739#discussion_r1411486198:

get => GetArtifactData().bytes;

set => this.bytes = value;
}
private (string text, byte[] bytes) GetArtifactData()
{
if (this.contents == null && this.bytes == null)
{
lock (this.archive)
{
if (this.contents == null && this.bytes == null)
{
string extension = Path.GetExtension(Uri.ToString());
if (this.binaryExtensions.Contains(extension))
{
this.bytes = new byte[Stream.Length];

ZipArchive supports providing a decompressed length. What's the performance impact of creating a memory stream as you describe? Will it effectively use the wrapped stream as its buffer? And so the length computation is computable before the memory stream did it? :)

This sounds like a nice next clean-up. It also makes me wonder if I've been over-concerned about handling non-seekable streams. This case may be entirely unremarkable based on what you've pointed out: we should simply read the original stream into a byte array, sniff it and then, if we need to decode, we wrap it in a memory stream (which will use the byte array as its buffer and not create a new copy).

- Reply to this email directly, view it on GitHubhttps://github.com/microsoft/sarif-sdk/pull/2739#discussion_r1411486198, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BCW4VCNWS56FFVJZRM5YCH3YHEU47AVCNFSM6AAAAABABOE6WKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTONJYHAZTGMJTGY. You are receiving this because you commented.Message ID: @.**@.>>

scottoneil-ms commented 8 months ago

Oh, you're right. I'm just proposing a copy with extra steps. Sorry.

What about disposing the stream and opening a new one to avoid the full copy? Call it "Seek(Origin), cave-man style."

From: Michael C. Fanning @.> Sent: Thursday, November 30, 2023 4:49 PM To: microsoft/sarif-sdk @.> Cc: Scott O'Neil (ALLEGIS GROUP HOLDINGS INC) @.>; Comment @.> Subject: Re: [microsoft/sarif-sdk] Add binary vs. textual file enumeration. (PR #2739)

@michaelcfanning commented on this pull request.

In src/Sarif/EnumeratedArtifact.cshttps://github.com/microsoft/sarif-sdk/pull/2739#discussion_r1411470826:

}

+

Stream = null;
return (this.contents, this.bytes);
}
private void RetrieveDataFromNonSeekableStream()
{
bool isText;
if (!SupportNonSeekableStreams)
{
throw new InvalidOperationException("Stream is not seekable. Provide a seekable stream or set the 'SupportNonSeekableStreams' property.");
}
this.bytes = new byte[Stream.Length];

OK, so first, 4096 is way too much data and I dropped it to 1024. This is probably still more data than we need.

Let's talk offline further about this (in particular about whether we can address the inefficiency in the non-seekable case of reading the whole thing). I'm just not aware of how we can handle a string decoding from the original non-seekable string if we've already obtained 1024 bytes from it.

- Reply to this email directly, view it on GitHubhttps://github.com/microsoft/sarif-sdk/pull/2739#discussion_r1411470826, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BCW4VCNKBOWUGERPBJLI64LYHESODAVCNFSM6AAAAABABOE6WKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTONJYHAYDONBZGM. You are receiving this because you commented.Message ID: @.***>

microsoft / sarif-sdk

Add binary vs. textual file enumeration. #2739