populationgenomics / automated-interpretation-pipeline

Rare Disease variant prioritisation MVP
MIT License
5 stars 4 forks source link

Noise Reduction - Polybase tracts #12

Open MattWellie opened 2 years ago

MattWellie commented 2 years ago

A number of the variants which are surviving the filtering and annotation process are predicted likely to be noise, in spite of quality filtering and VQSR.

A key category of variants which can be presumptively excluded are those occurring in Poly-base tracts. This issue should serve as a discussion board for how to refine and implement this feature.

e.g.

to clarify:

lgruen commented 2 years ago

Is the broader question maybe how we want to handle microsatellites / STRs? E.g. is it worth including a relevant variant catalog? @hopedisastro might be interested in this as well.

hopedisastro commented 2 years ago

If I'm understanding this correctly, the issue is determining the importance of in-repeat repeats/repeats found within a larger repetitive sequence. I think for the most part, these repeats are classified as de novo so only de novo callers like EHDn and STRling would pick them up without needing a variant catalog. I don't think there's a systematic approach to predicting their functional significance at the moment as each variant is reviewed on a case-by-case basis. Perhaps something to keep in mind as well is that there are relatively few in-repeat repeats called with the 2 de novo callers mentioned above (I think with STRling they mentioned around ~40/genome). So in that sense, if you did choose to retain this variant class, it wouldn't result in too much excess flags/potential artefacts.

hopedisastro commented 2 years ago

Also I would consider the noise:signal ratio across different motif lengths. I would expect more noise in homopolymer and dinucleotide runs compared to longer motifs and this may influence your approach to flagging.

MattWellie commented 2 years ago

I think ultimately this will boil down to the question of which variants we want to be called by different callers, and how to resolve any issues at the interfaces.

e.g. GATK is able to call indels, but where do we draw the line between InDel and STR such that when we bring in additional variant sources (e.g. EH), we are able to select high confidence indels from GATK, high confidence STRs from EH, and potentially scrap the rest.

For this limited purpose, GATK records the flanking sequence context for small variants. Where the variant is a poly-base repeat within the context of a poly-base repeat, we have been asked by the clinical team to presumptively remove these from consideration. Further exposure to real data will help us get a better picture of how liberal we can be with this removal, but inclusion of STR callers for a different view of the same data would definitely be a way to change this issue from 'potentially removing real variants' to 'using multiple variant callers, each to their strengths'

MattWellie commented 1 year ago

https://hail.is/docs/0.2/hail.expr.LocusExpression.html#hail.expr.LocusExpression.sequence_context