polaris-hub / polaris

Foster the development of impactful AI models in drug discovery.
https://polaris-hub.github.io/polaris/
Apache License 2.0
97 stars 6 forks source link

Generate a deterministic checksum for `DatasetV2` Zarr manifests #188

Open Andrewq11 opened 2 months ago

Andrewq11 commented 2 months ago

Context

The DatasetV2 class is the second generation of datasets on Polaris which are purpose-built to handle datasets of an unlimited size. This second generation heavily relies on the use of Zarr for chunking the dataset into more manageable pieces.

With DatasetV2, a new mechanism was implemented for guaranteeing the validity of datasets. This mechanism involves creating a PyArrow table with the following schema and saving it to disk as a Parquet file:

Path Checksum
str str

Once saved to disk, the md5 hash for the file is generated to produce the checksum for the Zarr manifest of the dataset. This checksum is not used to check the validity of the dataset, it is only used to ensure the manifest file is intact when pulled in on the Polaris Hub. Once the integrity of the manifest is confirmed, the individual chunk checksums are used to guarantee the validity of each dataset chunk (and thus, the dataset).

Description

We currently use os.scandir to recurse throughout the Zarr archive and build the PyArrow table described above. Because os.scandir does not scan a directory deterministically, we produce different PyArrow tables on each walk of the same Zarr dataset. This ultimately produces a different Parquet file saved to disk and a different md5 hash for the same dataset.

We should devise a way to deterministically walk a Zarr archive such that the same Zarr manifest is produced for the same dataset. This method must utilize Python generators to prevent excess memory usage given the potential size of V2 datasets.

Acceptance Criteria

Links

hmacdope commented 1 month ago

Just to duplicate conversation in PR in central place, might be worth investigating https://github.com/dandi/zarr_checksum as a possible solution to this problem

cwognum commented 1 month ago

Hey @hmacdope, good to see you here! 😄

And thanks for taking a look at this issue!

The implementation for the Dataset V1 checksum (see e.g. here) is based on zarr-checksum. See e.g. the licensing statement at the top. As we started to work towards a V2 for the Dataset implementation (specifically aimed at supporting XL Datasets), we realized this wouldn't scale if it assumes a list of files can be kept in memory.

The main purpose of the checksum is to verify completeness and integrity on upload. We want to make sure that all files the user had created locally make their way to the Hub. For this purpose, the checksum doesn't necessarily have to be deterministic. I do think there could be other use cases in which having a deterministic checksum is beneficial (hence this issue), but what these use cases are is less clear as of now.

Hope that provides some context!