ramiromagno / gwasrapidd

gwasrapidd: an R package to query, download and wrangle GWAS Catalog data
https://rmagno.eu/gwasrapidd/
Other
93 stars 14 forks source link

discrepancies between get_variant and get_association #20

Closed mightyphil2000 closed 3 years ago

mightyphil2000 commented 3 years ago

I was expecting these two different search strategies to identify the same variant IDs but they don't. Do you know why they perform differently?

trait="Plasma omega-6 polyunsaturated fatty acid levels (arachidonic acid)"
efo="arachidonic acid measurement"
efo_id=NULL

Strategy 1. identify variant IDs using combination of get_studies and get_associations

gwas_studies<-gwasrapidd::get_studies(efo_trait = efo,efo_id=efo_id,reported_trait=trait)           
gwas_associations<-gwasrapidd::get_associations(study_id = gwas_studies@studies$study_id)

> gwas_associations@risk_alleles$variant_id
 [1] "rs2581624"   "rs174545"    "rs4246215"   "rs174601"    "rs8523"     
 [6] "rs174549"    "rs174541"    "rs174549"    "rs174550"    "rs12580543" 
[11] "rs3811444"   "rs174535"    "rs2110073"   "kgp12662626" "rs7258177"  
[16] "rs174528"    "rs174577"    "kgp12035941" "rs1795851"   "rs1404384"  
[21] "rs16979306"  "rs6133127"   "rs1688589"   "rs1539053"   "kgp9433132" 
[26] "rs1882496"   "kgp6577813"  "kgp2178364"  "rs12714668"  "rs1080261"  
[31] "rs9942436"   "rs2637523"   "rs10839732"  "rs12471016"  "rs16829840" 
[36] "rs274557"    "rs9394931"   "rs12209128"  "rs174547"    "rs1741"     
[41] "rs102275"   

Strategy 2. identify variant IDs using get_variants

gwas_variants<-gwasrapidd::get_variants(efo_trait = efo,efo_id=efo_id,reported_trait=trait)     
> gwas_variants@variants$variant_id
 [1] "rs1080261"   "rs7258177"   "rs12714668"  "rs16979306"  "rs174528"   
 [6] "rs9942436"   "kgp2178364"  "rs174577"    "kgp12035941" "kgp6577813" 
[11] "rs1795851"   "rs2637523"   "rs1882496"   "rs1539053"   "rs174545"   
[16] "rs1404384"   "rs6133127"   "kgp12662626" "kgp9433132"  "rs1688589"  
[21] "rs10839732"  "rs2581624"   "rs274557"    "rs1741"      "rs12471016" 
[26] "rs174547"    "rs12209128"  "rs9394931"   "rs102275"    "rs16829840" 

Why are the two sets of variants different?

ramiromagno commented 3 years ago

Hi @mightyphil2000:

The example is not complete because the definition of the variable efo_id is missing. Could you provide that? May I suggest you use the reprex function to generate the minimal working example?

Furthermore, by searching by all those criteria simultaneously, do you intend to retrieve matches that fulfill all criteria, or that fulfill (at least) one of the criteria?

mightyphil2000 commented 3 years ago

Sorry about. I've updated the original post with efo_id set to NULL (which is how I've been setting it). I'm looking for matches that fulfill at least one of the criteria.

ramiromagno commented 3 years ago

Okay, thank you. I will take a look at it tomorrow.

ramiromagno commented 3 years ago

Hi @mightyphil2000:

I just looked into your example, and I do not have a good explanation for why we are seeing these differences; we should be getting the same variants either way. As far as I could inspect into gwasrapidd code, there seems to be no problem with gwasrapidd itself. So I forwarded your example to the GWAS Catalog team hoping to get some clarification on this issue. I will let you know as soon as I have a reply (likely this Monday).

ramiromagno commented 3 years ago

Hi @mightyphil2000,

I got a reply from Elliot Sollis (with the GWAS Catalog team).

Short answer

Use gwasrapidd::get_variants() function.

Explanation

Like you, I too had the expectation that both approaches would retrieve the same results. But apparently there are historical reasons that explain these differences for a subset of the older data in the Catalog.

Here is a post of the emails we exchanged (with permission from Elliot):

My question

Hi GWAS Catalog team,

I am getting a different number of variants depending on the query route I make using the REST API.

My query is based on the reported trait, i.e., "arachidonic acid measurement". I tried two approaches, here dubbed approach 1 and 2.

Approach 1: Search for studies by "arachidonic acid measurement" (which results in GCST002712 and GCST003236), then search for the respective associations, and from those associations retrieve the variants.

Approach 2: Search for variants directly by "arachidonic acid measurement".

I find that all variants obtained with approach 2 are included in those obtained with approach 1. But there is a set of variants that are exclusive to approach 1. Why? Is this a bug?

Here are the variants that are exclusive to approach 1:

# A tibble: 11 x 7
   association_id locus_id variant_id risk_allele risk_frequency genome_wide limited_list
   <chr>             <int> <chr>      <chr>                <dbl> <lgl>       <lgl>      
 1 10079997              1 rs4246215  NA                    0.36 FALSE       FALSE      
 2 10079998              1 rs174601   NA                    0.35 FALSE       FALSE      
 3 10079999              1 rs8523     A                     0.44 FALSE       FALSE      
 4 10080000              1 rs174549   NA                    0.29 FALSE       FALSE      
 5 10080001              1 rs174541   NA                    0.36 FALSE       FALSE      
 6 10080002              1 rs174549   NA                    0.29 FALSE       FALSE      
 7 10080003              1 rs174550   NA                    0.33 FALSE       FALSE      
 8 10080004              1 rs12580543 C                     0.06 FALSE       FALSE      
 9 10080005              1 rs3811444  T                     0.32 FALSE       FALSE      
10 10080006              1 rs174535   NA                    0.33 FALSE       FALSE      
11 10080007              1 rs2110073  T                     0.09 FALSE       FALSE      

Elliot's reply

Hi Ramiro,

I think I’ve figured this one out.

To explain it, I think I just need to confirm some of the terminology to make sure we’re on the same page: sorry if it seems pedantic! The reported trait is a free-text trait description that we annotate independently for each study. The term you used in your query, “arachidonic acid measurement” is what we just call the trait, which is the more standardised ontology term that is more comparable across different studies.

What’s relevant here is that for some older publications that included multiple GWAS analyses (e.g. 25500335 from 2014), technical limitations meant that we used to keep all of the analyses together in a single study entry (GCST002712) with a broad reported trait (in this case “Red blood cell fatty acid levels”), and put multiple ontology terms for each of the traits analysed in the trait field (in this case arachidonic acid measurement, fatty acid measurement, linolenic acid measurement, oleic acid measurement, docosapentaenoic acid measurement, linoleic acid measurement). So if you search for any of those trait terms, this study will come up. But be aware that not all variants in the study are associated with every one of those terms.

That’s at the level of the study, but at the level of the association we were able to be more specific, so each association is only annotated with ontology terms for the specific analysis that association came from, e.g. in that same study rs2581624-C has [arachidonic acid measurement, fatty acid measurement] but rs4246215-? has [fatty acid measurement, linoleic acid measurement]. That means that if you search by “arachidonic acid measurement” you will get rs2581624 but not rs4246215.

We now try to split every publication into separate studies for every GWAS analysis, but I’m not sure if we’ll necessarily be able to adjust all of the older studies. So I would suggest that Approach 2 is a better option if you want to make sure you only capture direct associations with a particular trait.

Hope that helps!

Best wishes,

Elliot Sollis, PhD Scientific Curator, GWAS Catalog European Bioinformatics Institute (EMBL-EBI) Wellcome Trust Genome Campus Hinxton Cambridge, CB10 1SD

mightyphil2000 commented 3 years ago

Thanks @ramiromagno. I see now. A problem with the get_variants function is that you can't search on reported trait (I guess because the older studies are not sufficiently annotated at the association level?). I am concerned that by only searching on EFO I might miss some associations. This could happen when the EFO is less specific than the reported trait. For example, rs4770891-? is associated with the reported trait "Circulating odd-numbered chain saturated fatty acid levels (C19:0)" but with EFO "fatty acid measurement" in study GCST005862. If I was specifically interested in the saturated fatty acid C19:0, then searching on EFO would not identify this association, whereas searching for studies on reported trait would. In other words, you could argue that search strategy 1 (search for studies by reported trait then retrieve associations) is more sensitive (but suffers more false positives), while search strategy 2 (search for variant-EFO associations) is more specific (but suffers more false negatives). What do you think?

ramiromagno commented 3 years ago

Hi @mightyphil2000:

I have divided this answer into three points.

Point 1

A problem with the get_variants function is that you can’t search on reported trait (I guess because the older studies are not sufficiently annotated at the association level?). I am concerned that by only searching on EFO I might miss some associations.

I am a bit confused now. You say you cannot search on report_trait with the get_variants() function, but you can, right?

my_variants_by_rep_trait <- get_variants(reported_trait = "Circulating odd-numbered chain saturated fatty acid levels (C19:0)")
my_variants_by_efo_trait <- get_variants(efo_trait = "fatty acid measurement")

my_variants_by_rep_trait@variants
## # A tibble: 15 x 7
##    variant_id merged functional_class    chromosome_name chromosome_position
##    <chr>       <int> <chr>               <chr>                         <int>
##  1 rs1468510       0 3_prime_UTR_variant 7                         151360599
##  2 rs11842790      0 intron_variant      13                         50350312
##  3 rs13226693      0 intron_variant      7                         151353528
##  4 rs17074145      0 intron_variant      13                         50355253
##  5 rs17363566      0 intron_variant      13                         50360756
##  6 rs706603        0 intron_variant      13                         50280887
##  7 rs12871645      0 intron_variant      13                         50357429
##  8 rs12703118      0 intron_variant      7                         151345197
##  9 rs12853498      0 intron_variant      13                         50339738
## 10 rs17074143      0 intron_variant      13                         50348637
## 11 rs13221923      0 intron_variant      7                         151357494
## 12 rs17074093      0 intron_variant      13                         50314567
## 13 rs12874827      0 intron_variant      13                         50326600
## 14 rs12874278      0 intron_variant      13                         50361786
## 15 rs4770891       0 intron_variant      13                         25983588
## # … with 2 more variables: chromosome_region <chr>, last_update_date <dttm>

my_variants_by_efo_trait@variants
## # A tibble: 156 x 7
##    variant_id merged functional_class          chromosome_name chromosome_posit…
##    <chr>       <int> <chr>                     <chr>                       <int>
##  1 rs1604805       0 intron_variant            4                        21569464
##  2 rs7198595       0 intron_variant            16                       12471050
##  3 rs12098564      0 non_coding_transcript_ex… 10                       85193571
##  4 rs11842790      0 intron_variant            13                       50350312
##  5 rs788076        0 intergenic_variant        10                       29047920
##  6 rs742614        0 intergenic_variant        20                       33894826
##  7 rs10518201      0 regulatory_region_variant 4                        78730706
##  8 rs12871645      0 intron_variant            13                       50357429
##  9 rs7534537       0 intron_variant            1                       204305391
## 10 rs2118674       0 intron_variant            2                       170462384
## # … with 146 more rows, and 2 more variables: chromosome_region <chr>,
## #   last_update_date <dttm>

Point 2

This could happen when the EFO is less specific than the reported trait. For example, rs4770891-? is associated with the reported trait “Circulating odd-numbered chain saturated fatty acid levels (C19:0)” but with EFO “fatty acid measurement” in study GCST005862.

I did not quite understand your “but” clause. Regardless, note that the variant you mentioned, i.e., rs4770891 is present in both sets of variants above:

'rs4770891' %in% my_variants_by_rep_trait@variants$variant_id
## [1] TRUE

'rs4770891' %in% my_variants_by_efo_trait@variants$variant_id
## [1] TRUE

Point 3

If I was specifically interested in the saturated fatty acid C19:0, then searching on EFO would not identify this association, whereas searching for studies on reported trait would. In other words, you could argue that search strategy 1 (search for studies by reported trait then retrieve associations) is more sensitive (but suffers more false positives), while search strategy 2 (search for variant-EFO associations) is more specific (but suffers more false negatives). What do you think?

I could not follow your reasoning because I did not observe the differences (as shown above) you are referring to for this specific example… So perhaps you could illustrate your concerns with some code?

mightyphil2000 commented 3 years ago

Sorry I confused the get_variant with get_association function. The get_association function does not allow one to search on reported trait. I should have said "and" instead of "but" in point 2. I'll try to put some code together to illustrate what I mean in point 3. Thanks!

mightyphil2000 commented 3 years ago

Actually I think this specific issue is now resolved. I can see that in this particular example get_variants is the best option. The new problem I was trying to explain is actually to do with the get_associations function. I'm therefore going to post this issue on the get associations issue page I opened the other day.

ramiromagno commented 3 years ago

Okay, I will close this issue then.