Previously we used the sha256sum of the sample content as main
distinguishing property for samples in the in-flight locking, duplicates
backlog and cached result lookup. Depending on the context this was
somewhat too lax or too strict:
For cached results we had already added the file extension because
particularly Cuckoo might select a different analysis package depending
on that property.
Other analysers might behave differently and be influenced by further
properties. Filetools for example can guess the file type from the file
name and do in the case of data: URLs consider more than just the
extension.
Also, with expressions, the admin might consider additional properties
explicitly within the ruleset without even employing any analyser.
Therefore we add the concept of a sample identity that includes all
client-controlled properties that might influence ruleset decisions such
as filename, content-type and content-disposition. This makes for a more
strict lookup of cached results and causes more analyses to be actually
performed instead of conflating results of "similar" samples. It also
lessens eagerness to put samples into the duplicate backlog either
locally or based on the in-flight markers of other cluster instances.
Note: We still consider the ruleset and analyser behaviour static, i.e.
old results would still need to be invalidated whenever analyser or
ruleset behaviour changes in a manner that makes old results
inconclusive.
Previously we used the sha256sum of the sample content as main distinguishing property for samples in the in-flight locking, duplicates backlog and cached result lookup. Depending on the context this was somewhat too lax or too strict:
For cached results we had already added the file extension because particularly Cuckoo might select a different analysis package depending on that property.
Other analysers might behave differently and be influenced by further properties. Filetools for example can guess the file type from the file name and do in the case of data: URLs consider more than just the extension.
Also, with expressions, the admin might consider additional properties explicitly within the ruleset without even employing any analyser.
Therefore we add the concept of a sample identity that includes all client-controlled properties that might influence ruleset decisions such as filename, content-type and content-disposition. This makes for a more strict lookup of cached results and causes more analyses to be actually performed instead of conflating results of "similar" samples. It also lessens eagerness to put samples into the duplicate backlog either locally or based on the in-flight markers of other cluster instances.
Note: We still consider the ruleset and analyser behaviour static, i.e. old results would still need to be invalidated whenever analyser or ruleset behaviour changes in a manner that makes old results inconclusive.