Map phenotypes from collapsing analyses to EFO

DSuveges commented 2 years ago

TBC

ireneisdoomed commented 2 years ago

GWAS Catalog is working on integrating both REGENERON summary statistics and AstraZeneca PheWAS Portal data. This way we can delegate the phenotype mapping task to them.

At the moment they already have the associations for REGENERON here, so we have been able to generate and validate evidence. In this table, in the regeneron_gwascat_evidence_mapping sheet, I have made an analysis of the quality of the mappings present in the final evidence.

https://docs.google.com/spreadsheets/d/16C6ScZBB8FjYjX_K5eGpPO6OcecwSShl5b2EMJWvdG4/edit?usp=sharing

Those mappings that I would change are marked with a FALSE in the review column. I have also classified their level of error by colour:

white, minor errors that could be better defined
yellow, errors of some relevance
red, clear error

Of the 470 mappings, 31 have been marked FALSE, 16 blank, 9 yellow and 6 red. My impression is overall very good.

Note that we have a few terms which are recurrent and are not present in the EFO OTAR Slim, which will therefore turn into failed evidence. We should see if they are of any value.

ireneisdoomed commented 2 years ago

As far as the AZ mappings go, I had a chat last Friday with Santhi, the curator in the GWAS Catalog responsible for these mappings and she has confirmed that they expect to have the whole set ready by the end of this week, so I expect to have AZ evidence integrated for 22.04.

Their rhythm is frankly incredible. This is an overview of their process:

The string is first cleaned not to contain references to the ICD-10 codes;
Zooma is run on top of these cleaned strings; results of good and medium confidence are kept;
When a “direct” match was not possible, they extract the label of the ICD-10 term that appears in the raw phenotype and apply Zooma on it.
If neither of the above worked, phenotypes undergo manual curation.

ireneisdoomed commented 2 years ago

GWAS Catalog has finished their curation of the AZ traits. They have provided us with their working Excel file so that we can integrate it while waiting for them to make them available through their platform. So the next step is to ingest them directly from their Downloads page as we do with REGENERON.

Location of the file: gs://otar000-evidence_input/GeneBurden/data_files/AZ_Traits.xlsx

We have already generated evidence from them, see #121

ireneisdoomed commented 2 years ago

After a meeting with GWAS Catalog to discuss Gene burden’s mappings: Some particular mappings that I thought were not accurate were discussed and they will be updating them. As a reminder, we have 6k evidence from AZ that we are dropping due to unmapped disease, mostly metabolomics related traits. We are missing these because they haven’t processed all traits available in the PheWAS Portal, but the ones reported in the publication, which is based on a smaller sample. Therefore we will be able to recover them by 22.06 ^^

Regarding REGENERON, they told me that only european based sumstats were submitted. Therefore, their studies only consider European individuals They weren’t aware of analyses done on other populations. At the same time, they were also unsure that these would pass their sumstats pipelines filters, so these may not be included in the future. This means we have a bug: the way we link Regeneron to GCST is by joining the data on the trait, not the trait and ancestry. We have 271 pieces of evidence of non european ancestry that are linking to the same study in the Catalog but done with europeans.

ireneisdoomed commented 2 years ago

AZ's summary stats are about to be included in the GWAS Catalog - they are just waiting for some EFOs to be added.

Santhi has shared with us a spreadsheet where she found some inconsistencies between our mappings and their mappings. https://app.zenhub.com/files/143733948/0fe3ea85-aed3-41e4-bc3d-396438c8737c/download

In my opinion, these are few (20) and minor so that we shouldn't worry about it.

When these data is out we should: 1) Change the AZ parser so that we pull the mappings from the GWAS Catalog instead of our manual curation repository. 2) I'd remove the AZ traits from the manual curation table, however we can simply decide we want to keep them. 3) Make sure that the unmapped quantitative traits mentioned above are also included in the Catalog.

d0choa commented 1 year ago

@ireneisdoomed can we close this?

ireneisdoomed commented 1 year ago

I think so, I was waiting for the data to be available in GWASCatalog, but it hasn't happened.

opentargets / issues

Map phenotypes from collapsing analyses to EFO #1940