Open Andrewq11 opened 2 months ago
Just to duplicate conversation in PR in central place, might be worth investigating https://github.com/dandi/zarr_checksum as a possible solution to this problem
Hey @hmacdope, good to see you here! 😄
And thanks for taking a look at this issue!
The implementation for the Dataset V1 checksum (see e.g. here) is based on zarr-checksum
. See e.g. the licensing statement at the top. As we started to work towards a V2 for the Dataset implementation (specifically aimed at supporting XL Datasets), we realized this wouldn't scale if it assumes a list of files can be kept in memory.
The main purpose of the checksum is to verify completeness and integrity on upload. We want to make sure that all files the user had created locally make their way to the Hub. For this purpose, the checksum doesn't necessarily have to be deterministic. I do think there could be other use cases in which having a deterministic checksum is beneficial (hence this issue), but what these use cases are is less clear as of now.
Hope that provides some context!
Context
The
DatasetV2
class is the second generation of datasets on Polaris which are purpose-built to handle datasets of an unlimited size. This second generation heavily relies on the use of Zarr for chunking the dataset into more manageable pieces.With
DatasetV2
, a new mechanism was implemented for guaranteeing the validity of datasets. This mechanism involves creating a PyArrow table with the following schema and saving it to disk as a Parquet file:Once saved to disk, the md5 hash for the file is generated to produce the checksum for the Zarr manifest of the dataset. This checksum is not used to check the validity of the dataset, it is only used to ensure the manifest file is intact when pulled in on the Polaris Hub. Once the integrity of the manifest is confirmed, the individual chunk checksums are used to guarantee the validity of each dataset chunk (and thus, the dataset).
Description
We currently use
os.scandir
to recurse throughout the Zarr archive and build the PyArrow table described above. Becauseos.scandir
does not scan a directory deterministically, we produce different PyArrow tables on each walk of the same Zarr dataset. This ultimately produces a different Parquet file saved to disk and a different md5 hash for the same dataset.We should devise a way to deterministically walk a Zarr archive such that the same Zarr manifest is produced for the same dataset. This method must utilize Python generators to prevent excess memory usage given the potential size of V2 datasets.
Acceptance Criteria
Links