incf-nidash / nidm-specs

Neuroimaging Data Model (NIDM): describing neuroimaging data and provenance
nidm.nidash.org
Other
33 stars 30 forks source link

Shasum of gzipped files depends on use of Matlab vs. Octave & host #457

Open cmaumet opened 6 years ago

cmaumet commented 6 years ago

Hi everyone,

In a NIDM-Results pack:

But shasum of gzipped files are different:

Differences in shasum can be explained by the fact that different processes were used to gzip the images. But, this is disserving our initial goal to be able to identify common images across multiple NIDM graphs (for reconciliation).

As a workaround, we could additionally store the shasum of the file before compression.

What are your thoughts on this?

Note: This issue was identified with @gllmflndn when testing the SPM-NIDM-Results exporter in Octave at https://github.com/incf-nidash/nidmresults-spm/pull/46 and briefly discussed on NIDM call (Jan. 29th, 2018).

nicholst commented 6 years ago

+1 on this... especially considering that gzip can be called with different options (e.g. compression level) and even have optional comment fields, this was always rather fragile. It's annoying, but I don't see a workaround.

satra commented 6 years ago

+1 on storing non-zipped sums. but in general since a change of a bit can effect a shasum, these are not good substitutes for anything other than identity.

we have always considered more flexible hashes to match binary blob, header, etc.,. we can describe an image based on overall hash, the blob being the same, the header being the same, etc.,.

cmaumet commented 6 years ago

We discussed this on NIDM call on March 5th.

@cmaumet - to write up a proposal on how to store the original shasum (including pros and cons).

satra commented 6 years ago

given that shasum's are bit dependent, what is the likelihood of two unzipped nifti files having the same shasum when run through the same processing say in spm and fsl?

i.e. should we start moving towards breaking down the information content into pieces that we want to query on.

cmaumet commented 6 years ago

@satra: if two pipelines reused the same data?

satra commented 6 years ago

@cmaumet - yes. i worry there are too many pieces in the nifti file that would be different.

so the only thing consistent would be at the level of the input data. and if that's the case, then the SHASUM as it stands currently would be fine to refer to input data.

cmaumet commented 6 years ago

@satra - what would be your suggestion of update for NIDM? Creating separate entities for headers & image, for each file?

satra commented 6 years ago

@cmaumet - perhaps it may be useful to know what sort of equality comparisons are you planning to make?

nicholst commented 6 years ago

I had understood this was only the most superficial comparison... "is this the same file" basically, with out any nuance of "Is this the same data before preprocessing". If I went in and changed the NIFTI comment header field the shasum should change. (It's a trivial change, but it's no longer the same file).

satra commented 6 years ago

@nicholst - that is correct. hence my question of what types of comparisons to make.

i used the phrase "same same but different" for an ohbm brainhack project last year, to illustrate issues with similarity. two files can be similar on the basis of:

image similarity

graph similarity

for this specific issue, perhaps we should be focusing on attributes directly/easily extractable from the image. we want a set of comparison attributes associated with an image. we could insert new attributes to the file, or create a new companion entity of similarity measures. i.e when are two files similar.

i do think this topic is worth a good discussion. we should determine what aspects of similarity we get:

and what use cases these pieces of information are intended to help address.

khelm commented 6 years ago

FYI - There is a similar discussion regarding the use of owl:sameAs. They point out that owl:sameAs is often used to convey "represents", "very similar to", "same thing but a different context", etc. Some of which are relevant to the discussion above by @satra