Open dhirschfeld opened 4 years ago
To achieve similar reproducibility goals MRO snapshots the entire CRAN universe every day. Where you control the repository you can instead simply associate an inserted/uploaded timestamp with the package data and filter on that to present the state of the repository at any given time.
ref: https://mran.microsoft.com/documents/rro/reproducibility
Hi @dhirschfeld thanks for the comment!
Indeed this is alrady possible with what we have today, because each package has a timestamp in the repodata.json. We could simply filter all packages with a timestamp > 123
to get to the repodata state as of 123
.
It will require some work in mamba to make it happen, though.
IIUC that's the build timestamp of the package? If you make the assumption that all packages are built by CI and uploaded immediately after then it could stand-in as a proxy for when it was available for a client (mamba/conda) to query/download. That assumption doesn't always hold though.
The solution conda/mamba comes up with is dependent on the universe of packages in the
repodata
. This means that installations at different times can potentially produce different environments.The proper solution to this is to export the explicit specs at the time you create the environment so that it can be recreated exactly in future.
Versioning the repodata can help in cases where the analyst didn't export the explicit specs for their environment. In this case to reproduce their results it could help to be able to specify an
as_of_timestamp
to solve for the environment given the state of therepodata
as of the specified time.