sample: Introduce identity

michaelweiser commented 2 years ago

Previously we used the sha256sum of the sample content as main distinguishing property for samples in the in-flight locking, duplicates backlog and cached result lookup. Depending on the context this was somewhat too lax or too strict:

For cached results we had already added the file extension because particularly Cuckoo might select a different analysis package depending on that property.

Other analysers might behave differently and be influenced by further properties. Filetools for example can guess the file type from the file name and do in the case of data: URLs consider more than just the extension.

Also, with expressions, the admin might consider additional properties explicitly within the ruleset without even employing any analyser.

Therefore we add the concept of a sample identity that includes all client-controlled properties that might influence ruleset decisions such as filename, content-type and content-disposition. This makes for a more strict lookup of cached results and causes more analyses to be actually performed instead of conflating results of "similar" samples. It also lessens eagerness to put samples into the duplicate backlog either locally or based on the in-flight markers of other cluster instances.

Note: We still consider the ruleset and analyser behaviour static, i.e. old results would still need to be invalidated whenever analyser or ruleset behaviour changes in a manner that makes old results inconclusive.

michaelweiser commented 2 years ago

Need to rework this for asyncio. Bear with me.

michaelweiser commented 2 years ago

This should be good to go now. Schema increase was done in another PR recently, so I did not increment it again.

scVENUS / PeekabooAV

sample: Introduce identity #218