populationgenomics / automated-interpretation-pipeline

Rare Disease variant prioritisation MVP
MIT License
5 stars 4 forks source link

So.. STRs? #396

Open MattWellie opened 2 weeks ago

MattWellie commented 2 weeks ago

STRs as a new variant type to include in analysis.

We are making use of Small variants and SVs (see #372... 😞), but CPG already runs STRipy reports on all samples. We can use that as a source of input data, pending any teething problems with incorporating STR logic in the MOI algorithms...

  1. How much do we trust STRipy for biallelic calls (i.e. calling each allele reliably, instead of dumping all the repeats on one allele?). Biallelic calls are crucial for the MOI tests we currently have in place.
  2. What's our source for pathogenic thresholds in STRs? Looks like STRipy self-assesses, do we need to update this periodically, or is there another source we should use for confirmation?
  3. We have JSON files for each individual sample, can we combine those into some kind of joint-call representation so as to make the ingestion interface simpler.
  4. someone who knows STRs better can flesh this out with likely issues I haven't thought of
ChiaraF32 commented 2 weeks ago

Hi Matt,

  1. I am not entirely sure about how accurate the bi-allelic calls are, but you could chat to Andreas Halman (who created STRipy). He's really helpful with any queries you have about the tool.
  2. You can also use STR archive, which was created by Harriet Dashnow.
  3. Hmm, not too sure about this, but I do know that as part of Expansion Hunter De Novo's workflow, they create a "multisample" profile which does merge JSON files across samples. Unsure if this is useful.

Cheers, Chiara

SamBryen commented 2 weeks ago
  1. I'm not sure either and as far as I know we don't have many (any?) positive controls at CPG to get a good feel for this. I suspect that it will vary for different loci too. There are some repeats that are overcalled all the time that don't look real, and I think we'd want to black list some of those.
  2. You could run a check periodically between the STRipy database, the STR archive Chiara mentioned and the current version we are using in our pipeline to assess when we need to change what STRipy calls in our pipeline. This will be trickier to automate because STRipy will need to be rerun for any new loci that appears in these databases. There are also a few bespoke loci that we've added that aren't in those databases. As for pathogenic thresholds, I wouldn't expect these to change much over time but I could be wrong here.
  3. A joint-call would be really useful, I think it would massively help in excluding the noisey artifactual calls that are called all the time in our callset.

In general, there aren't a huge number of 'real' looking STR expansion calls in these reports. If you can mostly filter by what is rare in our callset then I don't think you'll end up with too much noise if using the pathogenic cut-offs (or even just outliers) in those databases. Relying too heavily on MOI may not be as helpful especially for large STRs that are difficult to accurately detect in short read data.

cassimons commented 2 weeks ago

In general, there aren't a huge number of 'real' looking STR expansion calls in these reports. If you can mostly filter by what is rare in our callset then I don't think you'll end up with too much noise if using the pathogenic cut-offs (or even just outliers) in those databases.

+1 on this. Stripy reports very few variants and it already has good logic in place for flagging potentially pathogenic expansions that need review. The only issue seems to be ~2 loci that regularly turn up artefact calls. I have chatted with Andre about how to tune these better, but for the moment my feeling is just to hard blacklist these loci.

All we need to do is pull the flaged variants out of the json and pass them directly as a category. The biggest pain is probably going to be transforming them into a pseudo variant call we can inject in to seqr so we can represent them there. Stand alone reports will be fine, but it will just take some time to work through in seqr.