sgkit-dev / sgkit

Scalable genetics toolkit
https://sgkit-dev.github.io/sgkit
Apache License 2.0
217 stars 32 forks source link

Genome accessibility/callability #1219

Open mufernando opened 1 month ago

mufernando commented 1 month ago

It is important to consider genome accessibility when computing rates from genomic data.

scikit-allel has options to include an "accessibility mask", a boolean array indicating whether a base is accessible or not, and can be used to properly normalize quantities.

I found mentions of implementing this in #341

I am happy to help make this happen, but since I am new to the codebase I'd need some hand-helding... Ideally we would need a way of reading BED files which can be attached to the genotype dataset. Then, when computing per base statistics, we would need to intersect the accessible intervals with the windows intervals to get the right denominator.

jeromekelleher commented 1 month ago

Sounds like adding a bed2zarr command to vcf2zarr would be a great starting point - fancy taking it on???