sgkit-dev / vcf-zarr-publication

Manuscript and associated scripts for vcf-zarr publication
2 stars 7 forks source link

Comments from Tom #70

Closed jeromekelleher closed 5 months ago

jeromekelleher commented 6 months ago

Comments via email from @tomwhite (so I don't forget)

"By providing programmatic access, the data can be retrieved from storage, decoded and then analysed in the same memory space, avoiding the need for copying the decoded data and inter-process communication." https://github.com/sgkit-dev/vcf-zarr-publication/blob/e8d7fdbb3522c647c9fa17523b7d0343960aeb3e/paper.tex#L205 I would add "between Unix pipelines" to the end of this sentence, as I wasn't sure what IPC was being referred to until I read the rest of the section.

Discussion

"popular Zarr standard' https://github.com/sgkit-dev/vcf-zarr-publication/blob/e8d7fdbb3522c647c9fa17523b7d0343960aeb3e/paper.tex#L641C1-L641C22 It would be good to add a list of scientific projects that use Zarr. E.g. Pangeo, OME-Zarr, scverse.

"Such parallel write access is very difficult in the single-file setting, and one of the key weaknesses of the HDF5 format" https://github.com/sgkit-dev/vcf-zarr-publication/blob/e8d7fdbb3522c647c9fa17523b7d0343960aeb3e/paper.tex#L659-L661 True for writes, but for reading, Kerchunk has been quite successful in earth sciences for reading HDF5 datasets. Worth mentioning Kerchunk somewhere and saying why it doesn't work well for VCF/BCF? (Basically: HDF5 is chunk-oriented, VCF is row-oriented)

"Zarr provides pragmatic solutions to some of the more pressing problems facing the analysis of large-scale genetic variation data, but it is not a solution to all problems. ... dataset is basically static" https://github.com/sgkit-dev/vcf-zarr-publication/blob/e8d7fdbb3522c647c9fa17523b7d0343960aeb3e/paper.tex#L704C1-L707C40 True, but not a fundamental limitation of the spec. Arraylake (https://docs.earthmover.io/) allows updates, for example.

Methods

"Zarr is best seen as a specification that describes the details of how chunks are addressed and stored: there are multiple implementations across different languages." https://github.com/sgkit-dev/vcf-zarr-publication/blob/e8d7fdbb3522c647c9fa17523b7d0343960aeb3e/paper.tex#L842-L845 List them: Python, C, C++, Rust, Javascript and Java according to https://zarr.dev/