Open cmaumet opened 6 years ago
+1 on this... especially considering that gzip can be called with different options (e.g. compression level) and even have optional comment fields, this was always rather fragile. It's annoying, but I don't see a workaround.
+1 on storing non-zipped sums. but in general since a change of a bit can effect a shasum, these are not good substitutes for anything other than identity.
we have always considered more flexible hashes to match binary blob, header, etc.,. we can describe an image based on overall hash, the blob being the same, the header being the same, etc.,.
We discussed this on NIDM call on March 5th.
@cmaumet - to write up a proposal on how to store the original shasum (including pros and cons).
given that shasum's are bit dependent, what is the likelihood of two unzipped nifti files having the same shasum when run through the same processing say in spm and fsl?
i.e. should we start moving towards breaking down the information content into pieces that we want to query on.
@satra: if two pipelines reused the same data?
@cmaumet - yes. i worry there are too many pieces in the nifti file that would be different.
so the only thing consistent would be at the level of the input data. and if that's the case, then the SHASUM as it stands currently would be fine to refer to input data.
@satra - what would be your suggestion of update for NIDM? Creating separate entities for headers & image, for each file?
@cmaumet - perhaps it may be useful to know what sort of equality comparisons are you planning to make?
I had understood this was only the most superficial comparison... "is this the same file" basically, with out any nuance of "Is this the same data before preprocessing". If I went in and changed the NIFTI comment header field the shasum should change. (It's a trivial change, but it's no longer the same file).
@nicholst - that is correct. hence my question of what types of comparisons to make.
i used the phrase "same same but different" for an ohbm brainhack project last year, to illustrate issues with similarity. two files can be similar on the basis of:
image similarity
graph similarity
for this specific issue, perhaps we should be focusing on attributes directly/easily extractable from the image. we want a set of comparison attributes associated with an image. we could insert new attributes to the file, or create a new companion entity of similarity measures. i.e when are two files similar.
i do think this topic is worth a good discussion. we should determine what aspects of similarity we get:
and what use cases these pieces of information are intended to help address.
FYI - There is a similar discussion regarding the use of owl:sameAs
. They point out that owl:sameAs
is often used to convey "represents", "very similar to", "same thing but a different context", etc. Some of which are relevant to the discussion above by @satra
Hi everyone,
In a NIDM-Results pack:
But shasum of gzipped files are different:
Differences in shasum can be explained by the fact that different processes were used to gzip the images. But, this is disserving our initial goal to be able to identify common images across multiple NIDM graphs (for reconciliation).
As a workaround, we could additionally store the shasum of the file before compression.
What are your thoughts on this?
Note: This issue was identified with @gllmflndn when testing the SPM-NIDM-Results exporter in Octave at https://github.com/incf-nidash/nidmresults-spm/pull/46 and briefly discussed on NIDM call (Jan. 29th, 2018).