populationgenomics / automated-interpretation-pipeline

Rare Disease variant prioritisation MVP
MIT License
5 stars 4 forks source link

Multiple Panels #46

Closed MattWellie closed 1 year ago

MattWellie commented 1 year ago

Issue

We would like to enable taking results from multiple panels when running the analysis.

example problem: I am analysing a neuromuscular cohort, and some NM genes are present on the incidentalome, so current analysis would miss them. To solve this I would like to combine multiple panels during analysis, instead of taking a single panel.

Proposal?

No proposal yet, this will require input from the clinical team as there are multiple different options available

Common concept: We allow for multiple panels to be provided by argument, e.g. instead of '137', we could use '137,420,69'. Gene & MOI data from all the panels will be gathered from PanelApp. But what do we do with it?!

  1. Set Intersection - we query for all gene panels, and accept any gene as 'green' if it appears on any of the individual panels.

    • This model has complexities over what we do when the same gene is on different panels, and the MOI is not identical (either through lack of complete curation, or accepted MOI being different for specific diseases)
  2. Separate but Equal - we query for each individual panel, and store in parallel. Example would be taking the gene panel name or ID number and retaining as a top level key.

    • When running the analysis, we would iterate over each panel for each variant, and if the variant gene is on the panel, we would run with the panel-determined MOI
    • If we have the same gene on two panels, we process the variant twice
    • Confirmation events could include the panel ID, so the final report could differentiate between possible different MOIs on different panels
    • If we process a panel which doesn't contain the current variant's gene, simply continue.
{
    '137/mendeliome': {
         'ENSG1': {
              'symbol': 'SYMBOL'
         },
         ...
    },
    '420/blaze_it': {
         'ENSG2': {
              'symbol': 'SYMBOL2'
         },
         ...
    }
}

Issues

The current PanelApp query includes two different ways of determining if a gene is 'new' - using a prior-knowledge gene list, or using a prior panel version.

python query_panelapp.py --panels "137:1.0892,420:0.1234,69" ...

In order to retain the possibility of doing multiple panel versions and prior versions/panels, we should revert the exact historic version logic back to the date logic

MattWellie commented 1 year ago

My preference is to go with keeping each panel separate, and allowing for either a gene list (as we're currently using) or a prior date, that way we don't force users to provide multiple panels as well as per-panel prior versions.

The panelApp JSON written out with the analysis does include the 'latest' version used, as well as the date, so we could use either approach. I think the date will make for a cleaner CLI whilst accomplishing pretty much identical results

MattWellie commented 1 year ago

Proposal following today's meeting:

  1. Always keep the Mendeliome as a base
  2. Allow for one or more additional panels to be added by panelapp ID
  3. For each additional panel, add genes to the Mendeliome query results:
    • if the gene already exists in the Mendeliome, add a tag to the Mendeliome entry with this panel name
    • if the gene doesn't exist on the Mendeliome, add an entry with the MOI on this panel and a panel name tag
    • if the Mendeliome entry doesn't have an MOI but the specific panel does?? (unlikely)

After this has all been collected, annotate 'new' using a gene list if provided

End result will be a single dictionary (not stratified by different panels), with additional information to show if the gene featured on an additional panel. Shouldn't need much alteration of downstream logic, only to add the flag to the reported variants where appropriate. When the output is built as HTML, the flags should be presented in the final table to show if the gene was in a specific panel/panels (gold stars added as appropriate).

For discussion:

We can have a gene blacklist to withhold specific genes from the report.

If the incidentalome is one of the additional panels we could parse for specific tags and choose not to add the gene(s) to the gene list, but that means adding specific logic for one named panel. Rather than having to do that up front, treating all panels the same & using a gene blacklist for the reporting will make the logic more generic all over (there may be non-incidentalome reasons to block genes from the report, and this will cater to that use-case)

If it's crucial that only 'good' incidentalome genes are brought in at all, this logic can be tailored to anything, I'm easy