Open kvn95ss opened 2 years ago
Hi @kvn95ss , I'm sorry but custom datasets and population data are out of scope for SQUIRLS. This type of filtering should be done in a pipeline prior running SQUIRLS.
To see output for variants lower on the list, you would have to let N to be a large number. However, large N leads to large HTML reports which are sluggish and hard to work with. I am not sure anything else can really be done here. Would it be helpful to have 2 options for sorting the variants in the HTML: a) sort the variants on coordinates (chr1
first, then chr2
etc.), or b) sort by the scores?
Hello,
I'm sorry but custom datasets and population data are out of scope for SQUIRLS. This type of filtering should be done in a pipeline prior running SQUIRLS.
If this is the case, then sorting just based on scores might make sense. I guess an overview of our current workflow might let you understand where we are coming from -
I know there is CSV support in SQUIRLS, but I saw your example output and that's not how annovar outputs variants. For illustration I'm using the example CSV -
CHROM | ID | POS | REF | ALT |
---|---|---|---|---|
chr9 | SURF2_donor_4bp_downstream_exon3 | 136224694 | A | T |
chr3 | PBRM1_acceptor_5bp_upstream_exon | 52676065 | CA | C |
chr3 | BCHE_acceptor_8bp_upstream | 165504107 | A | C |
chr17 | BRCA1_frameshift_deletion_exon21 | 41199691 | TGGC | T |
After annovar annotation will look like -
CHROM | ID | POS | REF | ALT |
---|---|---|---|---|
chr9 | SURF2_donor_4bp_downstream_exon3 | 136224694 | A | T |
chr3 | PBRM1_acceptor_5bp_upstream_exon_DELETION | 52676065 | A | - |
chr3 | BCHE_acceptor_8bp_upstream | 165504107 | A | C |
chr17 | BRCA1_frameshift_deletion_exon21_DELETION | 41199691 | GGC | - |
chr17 | BRCA1_frameshift_deletion_exon21_INSERTION | 41199691 | - | GGC |
If squirls can support this type of input, that would also be fine by us.
In the other issue we raised - #46 we wanted to use the main vcf (from our step 1) as input, calculate the squirls score then use annovar (in our step 3) to incorporate the scores. However, if the annovar output is supported by squirls we can run it in our step 5.
Hi @kvn95ss , This sounds like an application-specific issue and it should be possible to resolve it with preprocessing script in your pipeline. A Python script like the one below should work:
#!/usr/bin/env python3
import csv
import sys
import unittest
import pysam
def main(fpath_fasta: str, fpath_in: str, fpath_out: str) -> int:
with pysam.FastaFile(fpath_fasta) as fasta, open(fpath_in) as fh_in, open(fpath_out, 'w') as fh_out:
reader = csv.DictReader(fh_in)
writer = csv.DictWriter(fh_out, fieldnames=reader.fieldnames)
writer.writeheader()
for row in reader:
processed = process_row(fasta, row)
writer.writerow(processed)
return 0
def process_row(fasta: pysam.FastaFile, row: dict) -> dict:
ref = row['REF']
alt = row['ALT']
if ref == '-' or alt == '-':
chrom = row['CHROM']
pos = int(row['POS'])
# we need to fetch the upstream base
pos -= 1
upstream_base = fasta.fetch(reference=chrom, start=pos-1, end=pos)
if ref == '-':
# INSERTION
row['REF'] = upstream_base
row['ALT'] = upstream_base + alt
else:
# DELETION
row['REF'] = upstream_base + ref
row['ALT'] = upstream_base
row['POS'] = str(pos)
else:
# SNV or MNV
pass
return row
if __name__ == '__main__':
argv = sys.argv
sys.exit(main(argv[1], argv[2], argv[3]))
class PreprocessTest(unittest.TestCase):
def setUp(self) -> None:
test_fasta = 'path/to/your/genome.fa' # TODO - edit to make the tests pass
self.fasta = pysam.FastaFile(test_fasta)
def tearDown(self) -> None:
self.fasta.close()
def test_process_row_snp(self):
row = {'CHROM': 'chr9', 'ID': 'SURF2_donor_4bp_downstream_exon3',
'POS': '136224694', 'REF': 'A', 'ALT': 'T'}
novel = process_row(self.fasta, row)
self.assertEqual(novel['POS'], '136224694')
self.assertEqual(novel['REF'], 'A')
self.assertEqual(novel['ALT'], 'T')
def test_process_row_ins(self):
row = {'CHROM': 'chr17', 'ID': 'BRCA1_frameshift_deletion_exon21_INSERTION',
'POS': '41199692', 'REF': '-', 'ALT': 'GGC'}
novel = process_row(self.fasta, row)
self.assertEqual(novel['POS'], '41199691')
self.assertEqual(novel['REF'], 'T')
self.assertEqual(novel['ALT'], 'TGGC')
def test_process_row_del(self):
row = {'CHROM': 'chr17', 'ID': 'BRCA1_frameshift_deletion_exon21_DELETION',
'POS': '41199692', 'REF': 'GGC', 'ALT': '-'}
novel = process_row(self.fasta, row)
self.assertEqual(novel['POS'], '41199691')
self.assertEqual(novel['REF'], 'TGGC')
self.assertEqual(novel['ALT'], 'T')
You can run the script as:
python3 preprocess.py path/to/genome.fa annovar_output.csv, updated.csv
where:
path/to/genome.fa
points to indexed genome FASTA fileannovar_output.csv
is the output of step 3 of your pipelineupdated.csv
will be a CSV file for SQUIRLSI saw the script just now, I am really grateful for it! I'll use it with our test data and see how it goes!
Hello!
We have a few enhancement requests to make. Mainly concerning the outputs -
As it stands, the HTML output provides details only for the top N results. If I have to see the HTML output for any variant lower on the list, I'd have to run Squirls only for those entries.
The HTML output is sorted solely by scores, it would be helpful if it incorporated population data or even custom data sets by way of allele frequency or allele counts. This should make filtering and prioritization quicker as we'd be focusing on rare + splice affecting variants.