opentargets / issues

Issue tracker for Open Targets Platform and Open Targets Genetics Portal
https://platform.opentargets.org https://genetics.opentargets.org
Apache License 2.0
12 stars 2 forks source link

Report PanelApp data findings to Genomics England #1665

Closed tskir closed 2 years ago

tskir commented 3 years ago

Follow-up issue from https://github.com/opentargets/platform/issues/1636.

Final set of suggestions will need to be compiled after EFO mapping is implemented as well.

tskir commented 3 years ago

List of some data normalisation issues (updated 2022-09-15)

Different ways of specifying the publications in the API

98(6):1193-2The list of everything I could find 07. doi: 10.1016/j.ajhg.2016.05.004. PubMed PMID: 27259053, PubMed Central PMCID: PMC4908191.
doi:10.​1007/​s12265-016-9673-5
DOI: https://doi.org/10.1016/j.xhgg.2021.100033
https://doi.org/10.1101/797787
https://doi-org.ezproxy.library.qmul.ac.uk/10.1093/brain/awaa085
PMID: 26933893
PMID: 27078007 (full text not available to confirm findings).
Aldahmesh (2012) Genet Med 14(12):955-962, PMID: 22935719
12702164
15985586 (two siblings)
16060907 (Camilot et al., 2005 report subclinical hypothyroid subjects with heterozygous substitutions
25674101 - review from the same authors as PMID:23972370

2194867118000911 — but displayed as two separate ones: https://panelapp.genomicsengland.co.uk/panels/81/gene/FAM20C/#!details

tskir commented 2 years ago

Investigation of 2022 data update (2020-09-28 → 2022-08-04)

Spreadsheet with data and the metrics is available here: https://docs.google.com/spreadsheets/d/1VBykrN6iyEqBuGJgOJYNOYefINe3dmJ2NjEKCuXmHcE/edit#gid=654888619. The following report contains some exerpts & analysis.

The files being compared are:

  1. Old: All_genes_20200928-1959.tsv (2020-09-28, MD5 024bbad3685a0a9797e63314e6e7c77a)
  2. New: All_genes_20220804-1350_green_amber_red_public_v1plus_no_superpanels.xlsx (2022-08-04, MD5 6a52bece16f49891f8b9aa7135d0e476).

Compared to the old file, the new file has significantly fewer:

The complete list of the 29 panels which are missing from the new file: Panel ID Panel name Number of genes
8 Refuted genes 5
14 Multiple bowel polyps 14
28 Congenital neutropaenia 17
32 Kyphoscoliotic Ehlers-Danlos syndrome 3
58 Ehlers-Danlos syndrome type 3 55
64 ClinGen Gene Validity Curations 47
67 Epileptic encephalopathy 183
121 A- or hypo-gammaglobulinaemia 28
124 Combined B and T cell defect 24
135 Dilated Cardiomyopathy (DCM) 75
137 Familial colon cancer 26
160 Genetic Epilepsies with Febrile Seizures Plus (GEFS+) 6
161 Epilepsy Plus 142
170 SCID 25
203 Agranulocytosis 2
204 Bilateral microtia 46
210 ClinGen_Familial thoracic aortic aneurysm and aortic dissection 53
240 Familial Genetic Generalised Epilepsies 25
252 Familial Focal Epilepsies 10
268 Meiges disease 14
289 Multiple Tumours 129
297 Bardet-Biedl Syndrome 22
399 Additional findings health related 14
412 Gene therapy clinical trials 21
657 Autism 735
720 Groopman et al 2019 - Genes with diagnostic variants 66
745 CHARGE syndrome 1
880 Nephrolithiasis and Nephrocalcinosis_KidGen_VCGS 30
928 Viral resistance 24

The metrics which have not significantly changed between the versions:

Finally, our parser successfully runs on the new file and generates the evidence, which valid against the schema. However, in light of several important-looking panels missing, I do not recommend migration to it right away.

tskir commented 2 years ago

Followed up with Eleanor regarding data normalisation

Re-ran the parser on the new data (All_genes_20220804-1350_green_amber_red_public_v1plus_no_superpanels.tsv) with debug tables, made sure that the preprocessing regular expressions are still holding up. As far as I can see, everything still gets processed correctly. The debug tables can be found here, sheets "Phenotypes" and "PMIDs".

There are no more actions on our side. Closing this issue.