kauwelab / PolyRiskScore

PRSKB is a website and command-line interface tool for calculating polygenic risk scores using GWA studies from the NHGRI-EBI Catalog.
23 stars 1 forks source link

tables/study_table.tsv #414

Closed erskck1 closed 1 year ago

erskck1 commented 1 year ago

Hi, Thank you for developing this great tool!

My question is :
There is a file named study_traits.tsv in the tables directory. Does this include studies that the program can calculate? When filtering traits and studies, can we use this or should we use the tables published by GWAS Catalog? (https://www.ebi.ac.uk/gwas/docs/file-downloads)

Thanks and best wishes, Ersoy

MattCloward commented 1 year ago

Hello Ersoy and thank you for your interest in PRSKB!

If I understand your questions correctly, you are looking for the best way to find the traits and studies you want to use in your PRS calculations. Is that right? The study_traits.tsv is a filtered version of the GWAS Catalog data and indeed contains most of the studies PRSKB can use to calculate scores. However, you can use our studies page to more efficiently find traits and studies to filter on. Additionally, you can browse available traits and studies on the calculate page using the traits drop down under "select trait(s) of interest."

Let me know if that answers your questions!

erskck1 commented 1 year ago

Hi Matthew,

Thank you for answering my question precisely. Yes, it is exactly what I meant. The problem is that I should not upload the data to your system and calculate PRS, because the data is not public and has sensitive content, so I want to use CLI version of PRSKB. But I have some problems with the selection of traits and studies.

I need to select traits/studies as comprehensively as possible according to the following criteria:

If I filter the Gwas Catalog's tables by these criteria, I find over 1500 mapped-traits, but if I filter study_tables.tsv by the same criteria, I only have over ~300 traits. I wanted to know if PRSKB is in sync with the GWAS Catalog and is there way to reach PRSKB database? I could not fully understand what kind of strategy I should follow for trait selection. As far as I understand, it is not possible to calculate PRS for all studies in the GWAS catalog using PRSKB, because some studies' summary_statistics is missing. "calculate page" makes it very easy but I could not upload my data or extract the query from this page.

The study_traits.tsv is a filtered version of the GWAS Catalog data

What criteria did you use to filter?

I hope I have correctly described my problem.

Thanks and best wishes, Ersoy

erskck1 commented 1 year ago

Hi again,

In the script update_database_scripts/createStudyTable.R, I have found the answers to many of my questions . But I still don't understand, how should I interpret the column lastUpdated. Because, publication date and lastUpdated are different dates.

Thanks, Ersoy

MattCloward commented 1 year ago

Thank you for your questions!

PRSKB is not currently synced with the GWAS catalog and we are working on it. The last sync was in March of this year.

Many of studies even from that month have been filtered out as described in our paper and in the script you found. Here is a summary from our paper:

"The data are filtered to include only associations that contain both a beta value (or odds ratio) and the respective risk allele. Each variant is analyzed independently (i.e., risk haplotypes are excluded). Sex-specific variants are not included in the database."

Any study that didn't have at least one association after these filtering criteria was removed, meaning all of the filtered studies match your criterion: "association count > 0"

lastUpdated is the date of the most recent changes to the associations to the study, which is frequently much later than the publication date of the study. We currently don't have a way to filter by publication date, but I will submit that as a feature request to the team.

erskck1 commented 1 year ago

Hi Matthew,

Thank you for your reply!

I decided to use study_table.tsv for filtering traits and I could run the scripts to create the study_table.tsv. You said that ;

Any study that didn't have at least one association after these filtering criteria was removed, meaning all of the filtered studies match your criterion: "association count > 0"

But in the study_table.tsv there are still some studies with 0 association count. Could this be because you filtered NA values instead of zero in the createStudyTable.R (line 230)? Why are some of the association counts are negative?

MattCloward commented 1 year ago

It sounds to me like you are looking at the "numAssociationsFiltered" column. That column shows the number of associations that were filtered out of the original study that are NOT in our table. Currently it's broken and I am trying to fix it. However, you are looking for the number of associations IN the table. We don't have a column for that yet, but I am working now to add it. In the mean time, you can use the associations_table.tsv to see the number of associations for every studyID. I hope that helps!

erskck1 commented 1 year ago

Perfect, that helped me a lot, we can close this issue. Thank you very much!