sgkit-dev / vcf-zarr-publication

Manuscript and associated scripts for vcf-zarr publication
2 stars 7 forks source link

Suggestion for expanding the "Extracting fields" section #141

Closed shz9 closed 3 months ago

shz9 commented 3 months ago

I think this section would likely attract some interest for people working with / maintaining / creating large-scale VCF files and it's a good idea to make the benefits even clearer to the reader:

Here's my suggestion for a re-write:


\subsection{Extracting, updating, and adding fields}

We have focused on the genotype matrix up to this point, contrasting Zarr with existing row-wise methods. 
Real-world VCFs encapsulate much more than just the genotype matrix, and can contain large numbers 
of additional fields, including many variant-level annotations that may be extensively used in a variety 
of downstream tasks, such as filtering and quality control. For instance, common workflows include 
adding or updating variant or contig IDs based on reference databases, e.g. dbSNP \cite{Sherry2001dbSNP}. 
For VCFs with large sample sizes, this step can be extremely slow and wasteful of storage resources when 
carried out with standard pipelines like \texttt{bcftools annotate} , even though the information being 
updated does not necessarily pertain to or depend on the large-scale genotype matrices. The columnar 
format that we propose here is well-suited for these tasks, since it decouples variant-level information 
from the storage-intensive call-level fields. To illustrate the potential benefits, Fig~\ref{fig-column-extract} 
shows the time required to extract the genomic position of each variant in the simulated benchmark dataset, 
which we can use as an indicative example of a per-variant query. Although Savvy is many times 
faster than \texttt{bcftools query} here, the row-wise storage strategy that they share means 
that the entire dataset must be read into memory and decompressed to extract just one field 
from each record. Zarr excels at these tasks: we only read and decompress the information required.

Suggested citation:


@article{Sherry2001dbSNP,
    author = {Sherry, S. T. and Ward, M.-H. and Kholodov, M. and Baker, J. and Phan, L. and Smigielski, E. M. and Sirotkin, K.},
    title = "{dbSNP: the NCBI database of genetic variation}",
    journal = {Nucleic Acids Research},
    volume = {29},
    number = {1},
    pages = {308-311},
    year = {2001},
    month = {01},
    abstract = "{In response to a need for a general catalog of genome variation to address the large-scale sampling designs required by association studies, gene mapping and evolutionary biology, the National Center for Biotechnology Information (NCBI) has established the dbSNP database [S.T.Sherry, M.Ward and K.Sirotkin (1999) Genome Res., 9, 677–679]. Submissions to dbSNP will be integrated with other sources of information at NCBI such as GenBank, PubMed, LocusLink and the Human Genome Project data. The complete contents of dbSNP are available to the public at website: http://www.ncbi.nlm.nih.gov/SNP. The complete contents of dbSNP can also be downloaded in multiple formats via anonymous FTP at ftp://ncbi.nlm.nih.gov/snp/.}",
    issn = {0305-1048},
    doi = {10.1093/nar/29.1.308},
    url = {https://doi.org/10.1093/nar/29.1.308},
    eprint = {https://academic.oup.com/nar/article-pdf/29/1/308/9905801/290308.pdf},
}

Maybe in the discussion, we mention something along the lines:

We foresee that the columnar format may speed up common quality control and annotation 
pipelines by orders of magnitude. One common workflow that may substantially benefit from 
this is performing liftover between genome builds, currently a slow and memory-intensive 
process that can potentially be streamlined with the columnar format proposed here.