polaris-hub / polaris

Foster the development of impactful AI models in drug discovery.
https://polaris-hub.github.io/polaris/
Apache License 2.0
82 stars 4 forks source link

Generate a deterministic checksum for `DatasetV2` Zarr manifests #188

Open Andrewq11 opened 3 weeks ago

Andrewq11 commented 3 weeks ago

Context

The DatasetV2 class is the second generation of datasets on Polaris which are purpose-built to handle datasets of an unlimited size. This second generation heavily relies on the use of Zarr for chunking the dataset into more manageable pieces.

With DatasetV2, a new mechanism was implemented for guaranteeing the validity of datasets. This mechanism involves creating a PyArrow table with the following schema and saving it to disk as a Parquet file:

Path Checksum
str str

Once saved to disk, the md5 hash for the file is generated to produce the checksum for the Zarr manifest of the dataset. This checksum is not used to check the validity of the dataset, it is only used to ensure the manifest file is intact when pulled in on the Polaris Hub. Once the integrity of the manifest is confirmed, the individual chunk checksums are used to guarantee the validity of each dataset chunk (and thus, the dataset).

Description

We currently use os.scandir to recurse throughout the Zarr archive and build the PyArrow table described above. Because os.scandir does not scan a directory deterministically, we produce different PyArrow tables on each walk of the same Zarr dataset. This ultimately produces a different Parquet file saved to disk and a different md5 hash for the same dataset.

We should devise a way to deterministically walk a Zarr archive such that the same Zarr manifest is produced for the same dataset. This method must utilize Python generators to prevent excess memory usage given the potential size of V2 datasets.

Acceptance Criteria

Links