snakemake-workflows / dna-seq-varlociraptor

A Snakemake workflow for calling small and structural variants under any kind of scenario (tumor/normal, tumor/normal/relapse, germline, pedigree, populations) via the unified statistical model of Varlociraptor.
MIT License
82 stars 38 forks source link

feat: add binned vaf column for sorting by allele frequency #240

Closed FelixMoelder closed 1 year ago

FelixMoelder commented 1 year ago

As workflows like dna-seq-mtb come with several callsets each being split by low and high allele frequency the final datavzrd report becomes cluttered. Instead of creating two separate callssets we could create a single one by adding additional binned allele frequency columns(binned into low, medium and high). This allows to sort variants by their AF showing variants with a high frequency on top of the report.

While this PR is just a preparation and the sorting needs to be defined in the callset configuration I would like to discuss if this implementation can be improved. Currently we have a distinct allele frequency for each sample (e.g. tumor and normal) resulting in a corresponding binned AF column. As we have use predefined callsets in the dna-seq-mtb workflow we also need to set the column name for sort in the default-config file which might be something like tumor: binned vaf. In this case we would assume that always a sample called tumor exist which might not be the case.

johanneskoester commented 1 year ago

Let us have just one column (called binned_max_vaf), which first takes the max VAF among all samples in a group and then bins it.

FelixMoelder commented 1 year ago

This should be good to go now.