The DatasetV2 class is the second generation of datasets on Polaris which are purpose-built to handle datasets of an unlimited size. This second generation heavily relies on the use of Zarr for chunking the dataset into more manageable pieces.
With DatasetV2, a new mechanism was implemented for guaranteeing the validity of datasets. This mechanism involves creating a PyArrow table with the following schema and saving it to disk as a Parquet file:
Path
Checksum
str
str
Once saved to disk, the md5 hash for the file is generated to produce the checksum for the Zarr manifest of the dataset. This checksum is not used to check the validity of the dataset, it is only used to ensure the manifest file is intact when pulled in on the Polaris Hub. Once the integrity of the manifest is confirmed, the individual chunk checksums are used to guarantee the validity of each dataset chunk (and thus, the dataset).
Description
We currently use os.scandir to recurse throughout the Zarr archive and build the PyArrow table described above. Because os.scandir does not scan a directory deterministically, we produce different PyArrow tables on each walk of the same Zarr dataset. This ultimately produces a different Parquet file saved to disk and a different md5 hash for the same dataset.
We should devise a way to deterministically walk a Zarr archive such that the same Zarr manifest is produced for the same dataset. This method must utilize Python generators to prevent excess memory usage given the potential size of V2 datasets.
Acceptance Criteria
Zarr manifest should include all files within the archive and their associated checksum
The Zarr manifest checksum is the same when generated numerous times for the same dataset
Context
The
DatasetV2
class is the second generation of datasets on Polaris which are purpose-built to handle datasets of an unlimited size. This second generation heavily relies on the use of Zarr for chunking the dataset into more manageable pieces.With
DatasetV2
, a new mechanism was implemented for guaranteeing the validity of datasets. This mechanism involves creating a PyArrow table with the following schema and saving it to disk as a Parquet file:Once saved to disk, the md5 hash for the file is generated to produce the checksum for the Zarr manifest of the dataset. This checksum is not used to check the validity of the dataset, it is only used to ensure the manifest file is intact when pulled in on the Polaris Hub. Once the integrity of the manifest is confirmed, the individual chunk checksums are used to guarantee the validity of each dataset chunk (and thus, the dataset).
Description
We currently use
os.scandir
to recurse throughout the Zarr archive and build the PyArrow table described above. Becauseos.scandir
does not scan a directory deterministically, we produce different PyArrow tables on each walk of the same Zarr dataset. This ultimately produces a different Parquet file saved to disk and a different md5 hash for the same dataset.We should devise a way to deterministically walk a Zarr archive such that the same Zarr manifest is produced for the same dataset. This method must utilize Python generators to prevent excess memory usage given the potential size of V2 datasets.
Acceptance Criteria
Links