pik-primap / primap2

The next generation of the PRIMAP climate policy analysis suite
https://primap2.readthedocs.io
Apache License 2.0
8 stars 2 forks source link

Order of attrs fields not stable in interchange format #184

Open JGuetschow opened 10 months ago

JGuetschow commented 10 months ago

Describe the bug

The order of the attr fields in the interchange format is not always the same leading to differences in the same data saved by different users. In the yaml file this is directly visible, but as we have seen differences in checksums of binary files there might be a similar problem there.

Failing Test

No built in test is know to be failing. We've noticed this when re-reading a dataset version in the Andrew cement data repository See this pr @crdanielbusch can you clone primap2 and run make test to see if anything fails for you?

Expected behavior

Dataset metadata (and actual data) should always be ordered in the same way such that when saving with DataLad only actual data differences are detected as new and not reordering of metadata or data.

System (please complete the following information): Original data read on Linux mint, python 3.10.12, pandas 1.2.1, primap2 0.9.7, xarray 2023.10.1 Conflicting read on Mac OS: @crdanielbusch can you add your package versions here?

mikapfl commented 3 months ago

In general, this might be hard to achieve for binary data due to timestamps. But we can of course add this as a requirement and then start testing for it and specify it in the data format descriptions that the data needs to be ordered in a specific way and have a stable bitstream. This might e.g. limit our options for compression.

mikapfl commented 3 months ago

For example, zipfiles always have a timestamp, and other compression algorithms change their bitstream with newer versions of compression libraries (usually for higher compression ratios). We'd need to do some research into actually stable binary data formats if we need to commit to bitstream stability for longer periods (years).