The only Phenotype matching currently offered in Talos is using the floppy panel~participant mapping, which is reliant on the terms and granularity of terms added to panels and participants in order to work
This replaces and expands on #390
During this change I also ran Bandit for code inspection, and the only complaint it had was that the code was peppered with Asserts (see here). Asserts are removed when code is optimised, so I've removed a few assert statements to be more secure
Proposed Changes
This takes in 2 files - a phenotype DB file containing an HPO ontology, and a NCBI dump file containing all Gene Symbols and the phenotypes they've been associated with
We parse this NCBI file, collecting all the phenotype terms associated with each Gene (symbol)
Adds a new Stage HPOFlagging - this uses the database of Gene-Phenotype and gathers data for all the genes in this analysis. For each of the variants in the report JSON, it checks if the gene symbol (and its associated HPO terms) is a phenotypic match to the HPO terms in the family. This is either done strictly using a set intersection, or by using Semsimian to conduct a pairwise comparison between the sets of HPO terms. All terms (ID: Label) which pass are recorded as a label for the variant.
HPOFlagging takes in one report JSON, and exports 2 JSON objects - one with Phenotype annotations, and one filtered to retain only those with phenotype annotations. The input and output are in the same format, so it sits as an optional step in the workflow.
The Gene~Phenotype DB works on Symbol, but our Gene IDs are mainly Ensembl ENSG IDs. This is handled through the FindGeneSymbolMap step - this runs API queries on the Ensembl REST API to get a symbol for each ENSG in our PanelApp data. This is used to match between the data files and report content during phenotype matching.
src/talos/MatchGenesToPhenotypes.py still exists here, but the HPO-based matching of participants to Gene HPO terms is too generous and too long-running to be done in advance with Semsimian. In a couple of tests, roughly half the gene search space was 'phenotype matched' to a number of families, and the comparisons took ages. Template is being kept, but not in use
Meaningful changes (not just linting)
requirements.txt - adds semsimian
setup.py - adds 2 new script entrypoints
src/talos/CreateTalosHTML.py - adapts to phenotype date and HPO matches, displays with a tooltip
src/talos/FindGeneSymbolMap.py - new script, finds matching symbols for all ENSG IDs in the PanelApp results
src/talos/HPOFlagging.py - matches participant HPOs to gene HPOs, labels results with a list of relevant phenotypes. The core of this comparison is identical to #390
src/talos/MatchGenesToPhenotypes.py - not working yet. This was intended to compare all participant HPO groups to all gene HPO groups, identifying a set of genes for each participant which should be treated more permissively
src/talos/example_config.toml - adds a couple of config entries relevant to new functionality
src/talos/models.py - add new values in the Report and History data models
src/talos/templates/variant_table.html.jinja - add a phenotype column, and its contents
src/talos/utils.py - new method to update history file with phenotype-match content
test/test_hpo_flagging.py - one small test, needs more ideally
History
if a history file can be located, adds the date_of_phenotype_match to the variant if it was previously a phenotype match, or updates the history with latest_phenotype_match == today in the history file, and re-saves it
Report
Adds a new column to the report with "Phenotype Match" - this is either "No Match", or Phenotype Match! with an icon to follow. The icon has a tooltip featuring the latest date of a phenotype match, and the list of HPO ID & Labels which were matched between the gene and family.
Checklist
[x] Related Issue created
[x] Tests covering new change
[x] Linting checks pass
Remaining TOD
Probably some more tests, lets be real
Settle on a process to source the reference files periodically, and a place to find them (cpg-common?)
Fixes
Proposed Changes
HPOFlagging
- this uses the database of Gene-Phenotype and gathers data for all the genes in this analysis. For each of the variants in the report JSON, it checks if the gene symbol (and its associated HPO terms) is a phenotypic match to the HPO terms in the family. This is either done strictly using a set intersection, or by using Semsimian to conduct a pairwise comparison between the sets of HPO terms. All terms (ID: Label
) which pass are recorded as a label for the variant.FindGeneSymbolMap
step - this runs API queries on the Ensembl REST API to get a symbol for each ENSG in our PanelApp data. This is used to match between the data files and report content during phenotype matching.Meaningful changes (not just linting)
History
if a history file can be located, adds the date_of_phenotype_match to the variant if it was previously a phenotype match, or updates the history with latest_phenotype_match == today in the history file, and re-saves it
Report
Adds a new column to the report with "Phenotype Match" - this is either "No Match", or
Phenotype Match!
with an icon to follow. The icon has a tooltip featuring the latest date of a phenotype match, and the list of HPO ID & Labels which were matched between the gene and family.Checklist
Remaining TOD