populationgenomics / automated-interpretation-pipeline

Rare Disease variant prioritisation MVP
MIT License
5 stars 4 forks source link

Group Outputs by Families #45

Closed MattWellie closed 1 year ago

MattWellie commented 1 year ago

Currently AIP generates per-sample results, where each affected individual across the whole cohort can be assessed independently (MOI tests are conducted per-family, but the operation is individual-centric). The output format is a dictionary, keyed by the sample IDs at the top level.

{
    'sample_1': {
        'variant_1_key': _DETAILS_,
        ...
    },
    ...
}

This has not been a problem so far, as each family in the development dataset is a trio with a single affected participant. If we have multiple affected persons within a single family, we would expect repetition of results for each participant (under a complete penetrant model if MOI fits for one affected participant, it must fit for all affected participants).

In the JSON and HTML results each sample has a separate variant table, and the cohort-level stats count the number of variants per individual. This is potentially misleading, as we would be double-counting for every family with multiple affected persons.

Seqr links generated are per-family, so the same variant appearing against multiple family members is duplication of the exact same link.

If we move to analysing partial penetrance, is it possible that variants will pass MOI tests for some
affected participants and not others? I don't think so... it's the opposite scenario (presence in 
non-affected persons). So long as we are always treating affection status within a single family as 
relating to the same disease, we should no have any issues with this.

Proposal:

Using the Pedigree, aggregate the results for all members of a family when 'simplifying' the results (removing redundant entries). Instead of presenting per-sample, we should present results per-family.

This requires a bit more thought so as not to break the interface with the comparison process

MattWellie commented 1 year ago

Kicking this back until we decide on whether to use Jinja templating etc.

MattWellie commented 1 year ago

Re-opening! We can approximate this using a group-by family rather than individual ID. The Family ID is annotated onto the individual variant JSON blobs so this should be accessible to the templating