mamba-org / conda-specs

A collection of specs, and spec proposals for Mamba packages, recipes and repositories
3 stars 3 forks source link

[reproducibility] Versioning the repodata #3

Open dhirschfeld opened 4 years ago

dhirschfeld commented 4 years ago

The solution conda/mamba comes up with is dependent on the universe of packages in the repodata. This means that installations at different times can potentially produce different environments.

The proper solution to this is to export the explicit specs at the time you create the environment so that it can be recreated exactly in future.

Versioning the repodata can help in cases where the analyst didn't export the explicit specs for their environment. In this case to reproduce their results it could help to be able to specify an as_of_timestamp to solve for the environment given the state of the repodata as of the specified time.

dhirschfeld commented 4 years ago

Prior Art:

To achieve similar reproducibility goals MRO snapshots the entire CRAN universe every day. Where you control the repository you can instead simply associate an inserted/uploaded timestamp with the package data and filter on that to present the state of the repository at any given time.

ref: https://mran.microsoft.com/documents/rro/reproducibility

wolfv commented 4 years ago

Hi @dhirschfeld thanks for the comment! Indeed this is alrady possible with what we have today, because each package has a timestamp in the repodata.json. We could simply filter all packages with a timestamp > 123 to get to the repodata state as of 123. It will require some work in mamba to make it happen, though.

dhirschfeld commented 4 years ago

IIUC that's the build timestamp of the package? If you make the assumption that all packages are built by CI and uploaded immediately after then it could stand-in as a proxy for when it was available for a client (mamba/conda) to query/download. That assumption doesn't always hold though.