Order of attrs fields not stable in interchange format

JGuetschow commented 10 months ago

Describe the bug

The order of the attr fields in the interchange format is not always the same leading to differences in the same data saved by different users. In the yaml file this is directly visible, but as we have seen differences in checksums of binary files there might be a similar problem there.

Failing Test

No built in test is know to be failing. We've noticed this when re-reading a dataset version in the Andrew cement data repository See this pr @crdanielbusch can you clone primap2 and run make test to see if anything fails for you?

Expected behavior

Dataset metadata (and actual data) should always be ordered in the same way such that when saving with DataLad only actual data differences are detected as new and not reordering of metadata or data.

System (please complete the following information): Original data read on Linux mint, python 3.10.12, pandas 1.2.1, primap2 0.9.7, xarray 2023.10.1 Conflicting read on Mac OS: @crdanielbusch can you add your package versions here?

mikapfl commented 3 months ago

In general, this might be hard to achieve for binary data due to timestamps. But we can of course add this as a requirement and then start testing for it and specify it in the data format descriptions that the data needs to be ordered in a specific way and have a stable bitstream. This might e.g. limit our options for compression.

mikapfl commented 3 months ago

For example, zipfiles always have a timestamp, and other compression algorithms change their bitstream with newer versions of compression libraries (usually for higher compression ratios). We'd need to do some research into actually stable binary data formats if we need to commit to bitstream stability for longer periods (years).

mikapfl commented 2 days ago

I've added a solution for the interchange format in #268. For the native format, as I said, this isn't trivial due to compression libraries and other metadata. See e.g. a quest for hashing the contents of a netcdf file. Honestly, if we need to determine if something changed, maybe it is best to compare hashes of the interchange format on-disk representation…

JGuetschow commented 2 days ago

In the UNFCCC_non-AnnexI_data repository I use hashes of the data in memory to determine if there are changes. The hash is stored in the filename, so I can compare to the hash of the old data without loading and rehashing it.

JGuetschow commented 2 days ago

As it's fixed for the interchange format we can close this now, I think. Maybe we can add a sentence to the docs mentioning the instability of hashes for the netcdf format

primap-community / primap2

Order of attrs fields not stable in interchange format #184