iobio / cohort-gene

1 stars 0 forks source link

Statistics & global filtering #18

Open AlistairNWard opened 6 years ago

AlistairNWard commented 6 years ago
  1. How is enrichment being calculated?

  2. It would be good to see raw information on the how many samples harbour each variant. If there are n=N probands and n=M in the subset, it would be good to see that x of N, and y of M samples harbour the variant. This would show in the summary panel when you select the variant, and the y position showing enrichment would help the user which variants to look closer at.

  3. Longer term, we should think how we can get additional information, e.g. the number of de novos or different zygosities.

stefinfection commented 6 years ago

Currently, we have variants displayed in a fold frequency fashion along the y-axis. We'd like to incorporate some more statistically rigorous metric to this display, or perhaps in a filter.

stefinfection commented 6 years ago

Thinking simple Fisher's Exact Test would be appropriate. Would this be a filter, an alternate main view, a view within the popup... etc?

stefinfection commented 6 years ago

Toggle with manhattan plot style

AlistairNWard commented 6 years ago

The default should probably be the Manhattan plot based on this test. This is a view people are familiar with

stefinfection commented 6 years ago

Suggestions from H. Coon & N. Camp:

  1. Switch between using a Fisher's Exact Test for small counts and an Armitage trend test for larger counts and filter variants according to a user-defined p-value (on load, we can use a relaxed threshold?)
  2. Make very apparent to user that statistics are not super robust and goal is to give a starting point to further investigate
  3. Be wary in the future of comparing unaffected cohorts w/ affected cohorts - needs further thinking and discussion
  4. For global findings, incorporate functional pathways in process to narrow down gene's we're looking across. Have user provide pathway genes or at least ability to add/subtract from predefined list.
stefinfection commented 6 years ago

Suggestions from G. Marth & A. Farrell at lab meeting 08/28:

  1. Filter vcf using pass column calls and create new vcf to work off of
  2. Filter vcf using quality column (what should a passing value be?) & vt filter
  3. If needed, look at combining allele frequencies and also filtering on those
  4. Compare variants coming back after these steps in phase 1 to those called by consortium (this will require finishing # to allow user uploaded files)
AlistairNWard commented 6 years ago

Quality might be not that useful, but other values would be good.

Allele count / allele frequency / allele balance We want to have a minimum coverage for the locus (say 20x, but we can play around with these). If a het, we would expect to see ~50% alt allele. So maybe demand at least 25% alternate allele frequency.

We also expect there to be observations from forward and reverse strand, so make sure we see representation of the alt allele from both strands.