ramiromagno / gwasrapidd

gwasrapidd: an R package to query, download and wrangle GWAS Catalog data
https://rmagno.eu/gwasrapidd/
Other
89 stars 15 forks source link

Link association_id, study_id and ancestry_id #10

Closed mightyphil2000 closed 2 years ago

mightyphil2000 commented 3 years ago

Hi there

Thanks for creating a lovely package!

Is there a way to retrieve associations searching on reported trait and then linking the associations to study_id and ancestry? This is what I do at the moment:

  1. get_studies(reported_trait = "colorectal cancer")
  2. Then I loop the get_associations() function over the study_ids retrieved from the first step.
  3. I'd now like to link the associations to their ancestry. I thought I'd be able to do that using study_id but this doesn't work because ancestry_id varies within study_id.

many thanks Philip

ramiromagno commented 3 years ago

Is this what you are looking for?


    library(gwasrapidd)
    library(purrr)
    library(dplyr)
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:gwasrapidd':
    #> 
    #>     intersect, n, setdiff, setequal, union
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union

    studies <- get_studies(reported_trait = "colorectal cancer")
    study_ids <- studies@studies$study_id
    names(study_ids) <- study_ids
    associations <-
      purrr::map(study_ids, ~ get_associations(study_id = .x))
    study2association <-
      purrr::imap_dfr(
        associations,
        ~ tibble::tibble(
          study_id = .y,
          association_id = .x@associations$association_id
        )
      )

    ancestries <-
      dplyr::left_join(studies@ancestries,
                       studies@ancestral_groups,
                       by = c('study_id', 'ancestry_id')) %>%
      dplyr::left_join(studies@countries_of_origin, by = c('study_id', 'ancestry_id')) %>%
      dplyr::rename(
        co_country_name = country_name,
        co_major_area = major_area,
        co_region = region
      ) %>%
      dplyr::left_join(studies@countries_of_recruitment,
                       by = c('study_id', 'ancestry_id')) %>%
      dplyr::rename(
        cr_country_name = country_name,
        cr_major_area = major_area,
        cr_region = region
      )

    (study_assoc_ancestry <-
        dplyr::left_join(study2association, ancestries, by = c('study_id')))
    #> # A tibble: 3,399 x 12
    #>    study_id association_id ancestry_id type  number_of_indiv… ancestral_group
    #>    <chr>    <chr>                <int> <chr>            <int> <chr>          
    #>  1 GCST000… 6063                     1 init…             1890 European       
    #>  2 GCST000… 6063                     2 repl…            12580 European       
    #>  3 GCST000… 11928                    1 init…             3831 European       
    #>  4 GCST000… 11928                    2 repl…            37210 European       
    #>  5 GCST000… 11928                    2 repl…            37210 European       
    #>  6 GCST000… 11928                    2 repl…            37210 European       
    #>  7 GCST000… 11925                    1 init…             3831 European       
    #>  8 GCST000… 11925                    2 repl…            37210 European       
    #>  9 GCST000… 11925                    2 repl…            37210 European       
    #> 10 GCST000… 11925                    2 repl…            37210 European       
    #> # … with 3,389 more rows, and 6 more variables: co_country_name <chr>,
    #> #   co_major_area <chr>, co_region <chr>, cr_country_name <chr>,
    #> #   cr_major_area <chr>, cr_region <chr>

Please note that the study_id and association_id are GWAS Catalog identifiers, and are absolute identifiers, meaning they are unique in the whole database. The ancestry_id is a dummy counter used only to distinguish different ancestries, so globally, you'd need to think of the pair study_id and ancestry_id together to uniquely define an ancestry globally.

mightyphil2000 commented 3 years ago

thank you very much. I also want to include SNP rsid and the genetic associations results. I guess I just have to tweak this section of the code to get that right?

associations <-
      purrr::map(study_ids, ~ get_associations(study_id = .x))
    study2association <-
      purrr::imap_dfr(
        associations,
        ~ tibble::tibble(
          study_id = .y,
          association_id = .x@associations$association_id
        )
      )
ramiromagno commented 3 years ago

I am closing this due to inactivity.

peranti commented 2 years ago

Hi @ramiromagno,

What does it mean when the same study_id and ancestral_group have a different ancestry_id?

ige_studies@ancestral_groups %>% filter(ancestral_group == "European")

## A tibble: 7 × 3
#   study_id   ancestry_id ancestral_group
#   <chr>            <int> <chr>          
# 1 GCST000222           1 European       
# 2 GCST000222           2 European      
ramiromagno commented 2 years ago

Hi @peranti,

The combination study_id and ancestry_id uniquely identifies ancestry samples.

For that specific study GCST000222 there are two ancestries. These are not identified with any special identifiers in the GWAS Catalog. Yet, in gwasrapidd, I assign a dummy identifier --- i.e., ancestry_id--- to distinguish them, and to allow linkage between tables that contain details about ancestries.

So that table in your example, i.e., ancestral_groups is just that, the ancestral groups for these two ancestries utilised in this study, which happen to be the same, i.e., European. However, if you look at the table ancestries you can see that they are different with respect to the stage of the ancestry sample (intial or replication), and the number of individuals comprising each ancestry sample.

In this case, there is only one ancestry group (European), as all individuals are from Germany. So you could think there should be only one ancestry then. But because there are two groups of individuals, albeit with the same ancestry group, but nevertheless two independent groups used in different stages of the study, then we need to distinguish them, and the ancestry_id serves that purpose. Almost all of the ancestry-related tables (ancestries, ancestral_groups, countries_of_origin and countries_of_recruitment) are, in this case, having values that are the same for study_id=GCST000222/ancestry_id=1 and study_id=GCST000222/ancestry_id=2, except for ancestries that shows indeed that we have two ancestry samples associated with a different stage and size.

I hope it is clear now.

library(gwasrapidd)

my_studies <- get_studies(study_id = 'GCST000222')
my_studies
#> An object of class "studies"
#> Slot "studies":
#> # A tibble: 1 × 13
#>   study_id   reported_trait initial_sample_size  replication_sample… gxe   gxg  
#>   <chr>      <chr>          <chr>                <chr>               <lgl> <lgl>
#> 1 GCST000222 IgE levels     1,530 European ance… 9,769 European anc… FALSE FALSE
#> # … with 7 more variables: snp_count <int>, qualifier <chr>, imputed <lgl>,
#> #   pooled <lgl>, study_design_comment <chr>, full_pvalue_set <lgl>,
#> #   user_requested <lgl>
#> 
#> Slot "genotyping_techs":
#> # A tibble: 1 × 2
#>   study_id   genotyping_technology       
#>   <chr>      <chr>                       
#> 1 GCST000222 Genome-wide genotyping array
#> 
#> Slot "platforms":
#> # A tibble: 1 × 2
#>   study_id   manufacturer
#>   <chr>      <chr>       
#> 1 GCST000222 Affymetrix  
#> 
#> Slot "ancestries":
#> # A tibble: 2 × 4
#>   study_id   ancestry_id type        number_of_individuals
#>   <chr>            <int> <chr>                       <int>
#> 1 GCST000222           1 initial                      1530
#> 2 GCST000222           2 replication                  9769
#> 
#> Slot "ancestral_groups":
#> # A tibble: 2 × 3
#>   study_id   ancestry_id ancestral_group
#>   <chr>            <int> <chr>          
#> 1 GCST000222           1 European       
#> 2 GCST000222           2 European       
#> 
#> Slot "countries_of_origin":
#> # A tibble: 0 × 5
#> # … with 5 variables: study_id <chr>, ancestry_id <int>, country_name <chr>,
#> #   major_area <chr>, region <chr>
#> 
#> Slot "countries_of_recruitment":
#> # A tibble: 2 × 5
#>   study_id   ancestry_id country_name major_area region        
#>   <chr>            <int> <chr>        <chr>      <chr>         
#> 1 GCST000222           1 Germany      Europe     Western Europe
#> 2 GCST000222           2 Germany      Europe     Western Europe
#> 
#> Slot "publications":
#> # A tibble: 1 × 7
#>   study_id   pubmed_id publication_date publication title        author_fullname
#>   <chr>          <int> <date>           <chr>       <chr>        <chr>          
#> 1 GCST000222  18846228 2008-08-22       PLoS Genet  Genome-wide… Weidinger S    
#> # … with 1 more variable: author_orcid <chr>
ramiromagno commented 2 years ago

Hi @peranti:

Did this clarification help?

ramiromagno commented 2 years ago

I am closing this due to inactivity.