monarch-initiative / Squirls

Interpretable prioritization of splice variants in diagnostic next-generation sequencing
https://squirls.readthedocs.io/en/master
Other
15 stars 4 forks source link

Enhancement request - granular HTML output #45

Open kvn95ss opened 2 years ago

kvn95ss commented 2 years ago

Hello!

We have a few enhancement requests to make. Mainly concerning the outputs -

As it stands, the HTML output provides details only for the top N results. If I have to see the HTML output for any variant lower on the list, I'd have to run Squirls only for those entries.

The HTML output is sorted solely by scores, it would be helpful if it incorporated population data or even custom data sets by way of allele frequency or allele counts. This should make filtering and prioritization quicker as we'd be focusing on rare + splice affecting variants.

ielis commented 2 years ago

Hi @kvn95ss , I'm sorry but custom datasets and population data are out of scope for SQUIRLS. This type of filtering should be done in a pipeline prior running SQUIRLS.

To see output for variants lower on the list, you would have to let N to be a large number. However, large N leads to large HTML reports which are sluggish and hard to work with. I am not sure anything else can really be done here. Would it be helpful to have 2 options for sorting the variants in the HTML: a) sort the variants on coordinates (chr1 first, then chr2 etc.), or b) sort by the scores?

kvn95ss commented 2 years ago

Hello,

I'm sorry but custom datasets and population data are out of scope for SQUIRLS. This type of filtering should be done in a pipeline prior running SQUIRLS.

If this is the case, then sorting just based on scores might make sense. I guess an overview of our current workflow might let you understand where we are coming from -

  1. Multiple VCFs consolidated using (GenomicsDB and genotypegvcfs module from GATK) to create a master VCF file with all samples
  2. Samples of interest extracted from master vcf
  3. VCF annotated with annovar, output as TSV
  4. Filtered based on population (gnomad, exac, etc)
  5. OMIM and HPO integration (Custom annotations, etc)
  6. Converting TSV to Excel for final analysis.

I know there is CSV support in SQUIRLS, but I saw your example output and that's not how annovar outputs variants. For illustration I'm using the example CSV -

CHROM ID POS REF ALT
chr9 SURF2_donor_4bp_downstream_exon3 136224694 A T
chr3 PBRM1_acceptor_5bp_upstream_exon 52676065 CA C
chr3 BCHE_acceptor_8bp_upstream 165504107 A C
chr17 BRCA1_frameshift_deletion_exon21 41199691 TGGC T

After annovar annotation will look like -

CHROM ID POS REF ALT
chr9 SURF2_donor_4bp_downstream_exon3 136224694 A T
chr3 PBRM1_acceptor_5bp_upstream_exon_DELETION 52676065 A -
chr3 BCHE_acceptor_8bp_upstream 165504107 A C
chr17 BRCA1_frameshift_deletion_exon21_DELETION 41199691 GGC -
chr17 BRCA1_frameshift_deletion_exon21_INSERTION 41199691 - GGC

If squirls can support this type of input, that would also be fine by us.

In the other issue we raised - #46 we wanted to use the main vcf (from our step 1) as input, calculate the squirls score then use annovar (in our step 3) to incorporate the scores. However, if the annovar output is supported by squirls we can run it in our step 5.

ielis commented 2 years ago

Hi @kvn95ss , This sounds like an application-specific issue and it should be possible to resolve it with preprocessing script in your pipeline. A Python script like the one below should work:

#!/usr/bin/env python3

import csv
import sys
import unittest

import pysam

def main(fpath_fasta: str, fpath_in: str, fpath_out: str) -> int:
    with pysam.FastaFile(fpath_fasta) as fasta, open(fpath_in) as fh_in, open(fpath_out, 'w') as fh_out:
        reader = csv.DictReader(fh_in)

        writer = csv.DictWriter(fh_out, fieldnames=reader.fieldnames)
        writer.writeheader()
        for row in reader:
            processed = process_row(fasta, row)
            writer.writerow(processed)

    return 0

def process_row(fasta: pysam.FastaFile, row: dict) -> dict:
    ref = row['REF']
    alt = row['ALT']
    if ref == '-' or alt == '-':
        chrom = row['CHROM']
        pos = int(row['POS'])
        # we need to fetch the upstream base
        pos -= 1
        upstream_base = fasta.fetch(reference=chrom, start=pos-1, end=pos)
        if ref == '-':
            # INSERTION
            row['REF'] = upstream_base
            row['ALT'] = upstream_base + alt
        else:
            # DELETION
            row['REF'] = upstream_base + ref
            row['ALT'] = upstream_base
        row['POS'] = str(pos)
    else:
        # SNV or MNV
        pass
    return row

if __name__ == '__main__':
    argv = sys.argv
    sys.exit(main(argv[1], argv[2], argv[3]))

class PreprocessTest(unittest.TestCase):

    def setUp(self) -> None:
        test_fasta = 'path/to/your/genome.fa'  # TODO - edit to make the tests pass
        self.fasta = pysam.FastaFile(test_fasta)

    def tearDown(self) -> None:
        self.fasta.close()

    def test_process_row_snp(self):
        row = {'CHROM': 'chr9', 'ID': 'SURF2_donor_4bp_downstream_exon3',
               'POS': '136224694', 'REF': 'A', 'ALT': 'T'}
        novel = process_row(self.fasta, row)
        self.assertEqual(novel['POS'], '136224694')
        self.assertEqual(novel['REF'], 'A')
        self.assertEqual(novel['ALT'], 'T')

    def test_process_row_ins(self):
        row = {'CHROM': 'chr17', 'ID': 'BRCA1_frameshift_deletion_exon21_INSERTION',
               'POS': '41199692', 'REF': '-', 'ALT': 'GGC'}
        novel = process_row(self.fasta, row)
        self.assertEqual(novel['POS'], '41199691')
        self.assertEqual(novel['REF'], 'T')
        self.assertEqual(novel['ALT'], 'TGGC')

    def test_process_row_del(self):
        row = {'CHROM': 'chr17', 'ID': 'BRCA1_frameshift_deletion_exon21_DELETION',
               'POS': '41199692', 'REF': 'GGC', 'ALT': '-'}
        novel = process_row(self.fasta, row)
        self.assertEqual(novel['POS'], '41199691')
        self.assertEqual(novel['REF'], 'TGGC')
        self.assertEqual(novel['ALT'], 'T')

You can run the script as:

python3 preprocess.py path/to/genome.fa annovar_output.csv, updated.csv

where:

kvn95ss commented 2 years ago

I saw the script just now, I am really grateful for it! I'll use it with our test data and see how it goes!