Closed totajuliusd closed 2 years ago
Hi Tota,
Thank you for your question.
Your code seems fine overall.
Let me just present a tidier approach that:
dplyr::left_join()
to join tables (better than assuming that the number of rows across tables will match by position, in general they will not, although in your case it seems to work)genes
table in the associations
object, not from the genomic_contexts
table from the variants
object. Results should be similar, but the genes
table reflects the genes to be annotated with that variant according to the authors of the study, whereas the genes listed genomic_contexts
are generated by the GWAS Catalog team by automatic workflows. So, why am I getting the genes from the genes
table? Just to show you that you could do differently. Stick with your approach if you prefer the annotation by the GWAS Catalog team instead of the original authors'.With regards to efficiency, both methods are equivalent. The bottleneck is always the retrieval of the data from the GWAS Catalog. In your case we need to get associations and variants. The rest is just wrangling.
Feel free to ask more questions! Happy coding.
library(gwasrapidd)
study_id <- "GCST004132"
associations <- get_associations(study_id = study_id)
variants <- get_variants(study_id = study_id)
# Because there are more than on gene associated with a locus
genes <- associations@genes %>%
dplyr::group_by(association_id, locus_id) %>%
dplyr::summarise(gene_name = paste(gene_name, collapse = ' '), .groups = 'drop')
association_results <-
associations@associations %>%
dplyr::select(association_id, pvalue, beta_number, or_per_copy_number) %>%
dplyr::left_join(associations@risk_alleles, by = 'association_id') %>%
dplyr::left_join(genes, by = c('association_id', 'locus_id')) %>%
dplyr::left_join(variants@variants, by = c('variant_id')) %>%
dplyr::transmute(
study_id = study_id,
association_id = association_id,
ID = variant_id,
CHROM = chromosome_name,
POS = chromosome_position,
risk_allele = risk_allele,
gene_name = gene_name,
P = pvalue,
beta = beta_number,
OR = or_per_copy_number
)
association_results
#> # A tibble: 119 × 10
#> study_id association_id ID CHROM POS risk_allele gene_name P beta
#> <chr> <chr> <chr> <chr> <int> <chr> <chr> <dbl> <dbl>
#> 1 GCST0041… 19144286 rs34… 1 1.60e8 G SLAMF8 1e- 6 NA
#> 2 GCST0041… 19144332 rs25… 3 5.31e7 C intergen… 6e- 9 NA
#> 3 GCST0041… 19144360 rs56… 3 1.89e8 C LPP 6e-10 NA
#> 4 GCST0041… 19144385 rs80… 13 4.23e7 C AKAP11 4e- 8 NA
#> 5 GCST0041… 19144411 rs48… 22 3.69e7 C NCF4 2e- 8 NA
#> 6 GCST0041… 19144456 rs10… 16 8.28e7 A CDH13 1e- 9 NA
#> 7 GCST0041… 19713859 rs14… 7 5.03e7 <NA> C7orf72 … 9e-12 NA
#> 8 GCST0041… 19713864 rs12… 17 4.24e7 <NA> NAGLU ST… 2e-11 NA
#> 9 GCST0041… 19713869 rs51… 19 4.87e7 <NA> IZUMO1 N… 4e-11 NA
#> 10 GCST0041… 19713874 rs10… 2 4.36e7 <NA> THADA ZF… 4e-11 NA
#> # … with 109 more rows, and 1 more variable: OR <dbl>
Thank you very much, when I said more efficient I meant tidier, so this is perfect and exactly what I wanted!
Hi, can you please tell me what would be the most efficient way to extract the following information for a particular study by study_id?
What I would like to end up with, would be a table with the following columns:
CHROM, POS, P, beta_or_OR,risk_allle (or REF and ALT),gene_name
I can do this using the following code. But I havent used S4 objects before, and am sure there must be a simpler way of making this table by using them correctly?
study_id <- "GCST004132" associations <- get_associations(study_id=study_id) variants <- get_variants(study_id = study_id)
assoc_df <- data.frame(P = as.numeric(associations@associations$pvalue), ID = associations@risk_alleles$variant_id, beta=associations@associations$beta_number,OR=associations@associations$or_per_copy_number, risk_allele=associations@risk_alleles$risk_allele) variants_df <- data.frame(ID=variants@variants$variant_id, CHROM=variants@variants$chromosome_name, POS=variants@variants$chromosome_position)
variants_assoc_df <- merge(assoc_df, variants_df, by="ID") gene_name_df <- variants@genomic_contexts %>% distinct(variant_id, .keep_all=T) %>% select(variant_id,gene_name) %>%rename(ID=variant_id) %>% as.data.frame()
variants_with_gene_name <- inner_join(variants_assoc_df, gene_name_df, by="ID")
I appreciate any help with this, Tota