Enhancement request - granular HTML output

kvn95ss commented 2 years ago

Hello!

We have a few enhancement requests to make. Mainly concerning the outputs -

HTML output to include variants of interest
HTML output with population filtered variants

As it stands, the HTML output provides details only for the top N results. If I have to see the HTML output for any variant lower on the list, I'd have to run Squirls only for those entries.

The HTML output is sorted solely by scores, it would be helpful if it incorporated population data or even custom data sets by way of allele frequency or allele counts. This should make filtering and prioritization quicker as we'd be focusing on rare + splice affecting variants.

ielis commented 2 years ago

Hi @kvn95ss , I'm sorry but custom datasets and population data are out of scope for SQUIRLS. This type of filtering should be done in a pipeline prior running SQUIRLS.

To see output for variants lower on the list, you would have to let N to be a large number. However, large N leads to large HTML reports which are sluggish and hard to work with. I am not sure anything else can really be done here. Would it be helpful to have 2 options for sorting the variants in the HTML: a) sort the variants on coordinates (chr1 first, then chr2 etc.), or b) sort by the scores?

kvn95ss commented 2 years ago

Hello,

I'm sorry but custom datasets and population data are out of scope for SQUIRLS. This type of filtering should be done in a pipeline prior running SQUIRLS.

If this is the case, then sorting just based on scores might make sense. I guess an overview of our current workflow might let you understand where we are coming from -

Multiple VCFs consolidated using (GenomicsDB and genotypegvcfs module from GATK) to create a master VCF file with all samples
Samples of interest extracted from master vcf
VCF annotated with annovar, output as TSV
Filtered based on population (gnomad, exac, etc)
OMIM and HPO integration (Custom annotations, etc)
Converting TSV to Excel for final analysis.

I know there is CSV support in SQUIRLS, but I saw your example output and that's not how annovar outputs variants. For illustration I'm using the example CSV -

CHROM	ID	POS	REF	ALT
chr9	SURF2_donor_4bp_downstream_exon3	136224694	A	T
chr3	PBRM1_acceptor_5bp_upstream_exon	52676065	CA	C
chr3	BCHE_acceptor_8bp_upstream	165504107	A	C
chr17	BRCA1_frameshift_deletion_exon21	41199691	TGGC	T

After annovar annotation will look like -

CHROM	ID	POS	REF	ALT
chr9	SURF2_donor_4bp_downstream_exon3	136224694	A	T
chr3	PBRM1_acceptor_5bp_upstream_exon_DELETION	52676065	A	-
chr3	BCHE_acceptor_8bp_upstream	165504107	A	C
chr17	BRCA1_frameshift_deletion_exon21_DELETION	41199691	GGC	-
chr17	BRCA1_frameshift_deletion_exon21_INSERTION	41199691	-	GGC

If squirls can support this type of input, that would also be fine by us.

In the other issue we raised - #46 we wanted to use the main vcf (from our step 1) as input, calculate the squirls score then use annovar (in our step 3) to incorporate the scores. However, if the annovar output is supported by squirls we can run it in our step 5.

ielis commented 2 years ago

Hi @kvn95ss , This sounds like an application-specific issue and it should be possible to resolve it with preprocessing script in your pipeline. A Python script like the one below should work:

#!/usr/bin/env python3

import csv
import sys
import unittest

import pysam

def main(fpath_fasta: str, fpath_in: str, fpath_out: str) -> int:
    with pysam.FastaFile(fpath_fasta) as fasta, open(fpath_in) as fh_in, open(fpath_out, 'w') as fh_out:
        reader = csv.DictReader(fh_in)

        writer = csv.DictWriter(fh_out, fieldnames=reader.fieldnames)
        writer.writeheader()
        for row in reader:
            processed = process_row(fasta, row)
            writer.writerow(processed)

    return 0

def process_row(fasta: pysam.FastaFile, row: dict) -> dict:
    ref = row['REF']
    alt = row['ALT']
    if ref == '-' or alt == '-':
        chrom = row['CHROM']
        pos = int(row['POS'])
        # we need to fetch the upstream base
        pos -= 1
        upstream_base = fasta.fetch(reference=chrom, start=pos-1, end=pos)
        if ref == '-':
            # INSERTION
            row['REF'] = upstream_base
            row['ALT'] = upstream_base + alt
        else:
            # DELETION
            row['REF'] = upstream_base + ref
            row['ALT'] = upstream_base
        row['POS'] = str(pos)
    else:
        # SNV or MNV
        pass
    return row

if __name__ == '__main__':
    argv = sys.argv
    sys.exit(main(argv[1], argv[2], argv[3]))

class PreprocessTest(unittest.TestCase):

    def setUp(self) -> None:
        test_fasta = 'path/to/your/genome.fa'  # TODO - edit to make the tests pass
        self.fasta = pysam.FastaFile(test_fasta)

    def tearDown(self) -> None:
        self.fasta.close()

    def test_process_row_snp(self):
        row = {'CHROM': 'chr9', 'ID': 'SURF2_donor_4bp_downstream_exon3',
               'POS': '136224694', 'REF': 'A', 'ALT': 'T'}
        novel = process_row(self.fasta, row)
        self.assertEqual(novel['POS'], '136224694')
        self.assertEqual(novel['REF'], 'A')
        self.assertEqual(novel['ALT'], 'T')

    def test_process_row_ins(self):
        row = {'CHROM': 'chr17', 'ID': 'BRCA1_frameshift_deletion_exon21_INSERTION',
               'POS': '41199692', 'REF': '-', 'ALT': 'GGC'}
        novel = process_row(self.fasta, row)
        self.assertEqual(novel['POS'], '41199691')
        self.assertEqual(novel['REF'], 'T')
        self.assertEqual(novel['ALT'], 'TGGC')

    def test_process_row_del(self):
        row = {'CHROM': 'chr17', 'ID': 'BRCA1_frameshift_deletion_exon21_DELETION',
               'POS': '41199692', 'REF': 'GGC', 'ALT': '-'}
        novel = process_row(self.fasta, row)
        self.assertEqual(novel['POS'], '41199691')
        self.assertEqual(novel['REF'], 'TGGC')
        self.assertEqual(novel['ALT'], 'T')

You can run the script as:

python3 preprocess.py path/to/genome.fa annovar_output.csv, updated.csv

where:

path/to/genome.fa points to indexed genome FASTA file
annovar_output.csv is the output of step 3 of your pipeline
updated.csv will be a CSV file for SQUIRLS

kvn95ss commented 2 years ago

I saw the script just now, I am really grateful for it! I'll use it with our test data and see how it goes!

monarch-initiative / Squirls

Enhancement request - granular HTML output #45