ramiromagno / gwasrapidd

gwasrapidd: an R package to query, download and wrangle GWAS Catalog data
https://rmagno.eu/gwasrapidd/
Other
89 stars 15 forks source link

Why are some gene names present in `genomic_contexts` but not in `ensembl_ids`? #42

Closed mzzclb closed 1 year ago

mzzclb commented 1 year ago

Hi, First thank you for such a great package.

I have been working on retrieval of gene data of certain variants through gwasrapidd package. I realized that variants can have incompatible gene data in ensembl_ids and genomic_context segments. For example, let assume I retrieve data of a variant using get_variants function. Some gene names of the variant might be different in the ensembl_ids table (or segment) than in the genomic_context table (or segment).

What could be the reason for this difference?

What is the difference between genomic_context and ensembl_ids of a variant in terms of gene?

Unfortunately, today i cannot reach gwas through gwasrapidd package. When i run the functions, i have retrieved zero data. Thus, i cannot add any example files.

ramiromagno commented 1 year ago

Hi @mzzclb

Thank you for reaching out.

Because I am also having trouble retrieving data from the GWAS Catalog I can't check the issue you are reporting.

For the moment, check whether your problem might be related to this question: https://rmagno.eu/gwasrapidd/articles/faq.html#genomic-coordinates-of-genomic-contexts-seem-to-be-wrong.

Meanwhile I will check with the GWAS Catalog team why the server is not responding.

mzzclb commented 1 year ago

Thank you for replying.

What I mentioned is not really related to the topic above at the link.

I mean that a variant can have different gene clusters in genomic_context and ensembl_ids segments. Could you examine the pdf file I added as an example? I created it from rmarkdown. ensembl_ids-and-genomic_context-of-a-variant.pdf

ramiromagno commented 1 year ago

Hi @mzzclb

The GWAS Catalog is running well again, so perhaps you could provide a specific example illustrating your question. I will try to answer nevertheless based on what you wrote.

The genomic_contexts table provides all Ensembl and RefSeq genes mapping within 50kb upstream and downstream of each GWAS Catalog variant.

Then, a specific gene is typically associated with one Ensembl identifier only but there are cases when it is associated with more than one Ensembl identifier, e.g. a gene locates in the haplotypic MHC region, see discussion here. The table ensembl_ids provides that info.

Here is an example:

library(gwasrapidd)

my_variants <- get_variants(variant_id = "rs2269423")

print(my_variants@genomic_contexts, n = 20)
#> # A tibble: 200 × 12
#>    variant_id gene_name    chromosome_name chromosome_position distance
#>    <chr>      <chr>        <chr>                         <int>    <int>
#>  1 rs2269423  FKBPL        6                          32177930    47642
#>  2 rs2269423  PPT2         6                          32177930    14252
#>  3 rs2269423  TNXB         6                          32177930    68592
#>  4 rs2269423  NOTCH4       6                          32177930    16913
#>  5 rs2269423  RNA5SP206    6                          32177930    99302
#>  6 rs2269423  RNA5SP206    6                          32177930    99302
#>  7 rs2269423  TSBP1-AS1    6                          32177930    76710
#>  8 rs2269423  PPT2-EGFL8   6                          32177930     5952
#>  9 rs2269423  FKBPL        6                          32177930    47642
#> 10 rs2269423  GPSM3        6                          32177930    12836
#> 11 rs2269423  PBX2         6                          32177930     6803
#> 12 rs2269423  MIR6721      6                          32177930     7814
#> 13 rs2269423  ATF6B        6                          32177930    49677
#> 14 rs2269423  EGFL8        6                          32177930     9649
#> 15 rs2269423  NOTCH4       6                          32177930    16913
#> 16 rs2269423  LOC100507547 6                          32177930    23565
#> 17 rs2269423  TNXB         6                          32177930    62596
#> 18 rs2269423  AGPAT1       6                          32177930        0
#> 19 rs2269423  MIR6833      6                          32177930     1886
#> 20 rs2269423  PPT2         6                          32177930    14255
#> # ℹ 180 more rows
#> # ℹ 7 more variables: is_mapped_gene <lgl>, is_closest_gene <lgl>,
#> #   is_intergenic <lgl>, is_upstream <lgl>, is_downstream <lgl>, source <chr>,
#> #   mapping_method <chr>
print(my_variants@ensembl_ids, n = 20)
#> # A tibble: 77 × 3
#>    variant_id gene_name ensembl_id     
#>    <chr>      <chr>     <chr>          
#>  1 rs2269423  FKBPL     ENSG00000224200
#>  2 rs2269423  FKBPL     ENSG00000204315
#>  3 rs2269423  FKBPL     ENSG00000223666
#>  4 rs2269423  FKBPL     ENSG00000230907
#>  5 rs2269423  PPT2      ENSG00000228116
#>  6 rs2269423  PPT2      ENSG00000206329
#>  7 rs2269423  PPT2      ENSG00000168452
#>  8 rs2269423  PPT2      ENSG00000206256
#>  9 rs2269423  PPT2      ENSG00000236649
#> 10 rs2269423  PPT2      ENSG00000221988
#> 11 rs2269423  PPT2      ENSG00000231618
#> 12 rs2269423  TNXB      ENSG00000168477
#> 13 rs2269423  TNXB      ENSG00000236236
#> 14 rs2269423  TNXB      ENSG00000206258
#> 15 rs2269423  TNXB      ENSG00000229353
#> 16 rs2269423  TNXB      ENSG00000233323
#> 17 rs2269423  TNXB      ENSG00000231608
#> 18 rs2269423  NOTCH4    ENSG00000235396
#> 19 rs2269423  NOTCH4    ENSG00000223355
#> 20 rs2269423  NOTCH4    ENSG00000204301
#> # ℹ 57 more rows

Created on 2023-07-04 with reprex v2.0.2

mzzclb commented 1 year ago

Hi @ramiromagno,

Thank you for your time.

What i mentioned is not related to different ensembl ids assigning to teh same gene.

A variant can have different gene clusters in genomic_context and ensembl_ids segments. Could you examine the code pasted below?

The genes of HCG23 and LOC105379657 are available in the ensembl_ids segment of the given variant although none of them is in the genomic_context segment.

library(gwasrapidd) rs137931178 <- gwasrapidd::get_variants(variant_id = "rs137931178") # I have checked rs13793117 as an example unique_genes_of_rs137931178_in_genomic_context <- unique(rs137931178@genomic_contexts$gene_name) unique_genes_of_rs137931178_in_ensembl_ids <- unique(rs137931178@ensembl_ids$gene_name) genes_of_genomic_context_of_rs137931178_not_in_ensembl_ids_rs137931178 <- setdiff(unique_genes_of_rs137931178_in_genomic_context,unique_genes_of_rs137931178_in_ensembl_ids) print(genes_of_genomic_context_of_rs137931178_not_in_ensembl_ids_rs137931178) # HCG23 and LOC105379657 are available in the ensembl_ids segment although none of them is in the genomic_context segment.

Why are some genes not included in the gene group in ensembl_ids segment of the variant?

ramiromagno commented 1 year ago

Hi @mzzclb,

I think I understand your question now, although I also think you've written the opposite of what you meant at the certain point. But please tell me otherwise.

So, in principle, you can have more gene names included in genomic_contexts than in ensembl_ids table but not the other way around. In your example that is the case. You have HCG23 and LOC105379657 in genomic_contexts but not in ensembl_ids. The reverse does not happen, i.e. you don't have a gene name showing up in ensembl_ids that would be missing from genomic_contexts.

When you wrote:

The genes of HCG23 and LOC105379657 are available in the ensembl_ids segment of the given variant although none of them is in the genomic_context segment.

I think you meant the other way around because HCG23 and LOC105379657 are available in the genomic_contexts table but not in ensembl_ids.

So why is it normal to have some gene names in the genomic_contexts but not in the table ensembl_ids. Well, like I said earlier, the genomic_contexts table provides all Ensembl and RefSeq genes mapping within 50kb upstream and downstream of each GWAS Catalog variant. However, only Ensembl genes have associated Ensembl identifiers. So there are RefSeq genes that either have other names in Ensembl or are non-existent at all, and therefore do not have an associated Ensembl identifier. The two cases you report are examples of each of these cases:

  1. The RefSeq gene HCG23 is known as TSBP1-AS1 in Ensembl. Note that TSBP1-AS1 is present both in genomic_contexts and in ensembl_ids.
  2. The RefSeq gene LOC105379657 is the name of a gene used by the NCBI when a published symbol is not available, i.e. orthologs have not yet been determined and hence the gene will provide a symbol that is constructed as 'LOC' + the GeneID. Again, this gene name only makes sense in the context of the NCBI system, not Ensembl's, so it has not an associated Ensembl identifier.

I hope this helps.

mzzclb commented 1 year ago

Thank you very much @ramiromagno

ramiromagno commented 1 year ago

You're welcome!