halbritter-lab / kidney-genetics

Kidney-Genetics - database of kidney-related genes
MIT License
1 stars 0 forks source link

Consider to include Human Nephrogenesis Atlas expression values in manual curation of high-evidence genes #8

Open ninarank opened 1 year ago

ninarank commented 1 year ago

Instead of or in addition to using Descartes fetal gene expression, consider using "Human Nephrogenesis Atlas".

The "Human Nephrogenesis Atlas" provides ssRNA seq data of two 14-weeks human kidneys (after QC: 8,316 cells (6,667 nephrogenic)). They identified 18 cell clusters in the developing kidney. For each gene, you can get expression values of each cell cluster or of the total "Developmental Kidney". I suppose the "Developmental Kidney" expression value could be used if you only want 1 value per gene for the fetal kidney. This values is accessible via an API (script in progress).

So far, I did not find out, how exactly they calculate this expression value (which normalization technique etc.). I screened the original paper, but they didn't specify it there. I have not read the paper in detail, yet.

https://sckidney.flatironinstitute.org/ To get an example expression value: => Tab 'Gene Network' => select 'Developmental Kidney' => select gene of interest

ninarank commented 1 year ago
# This is a pre-version of a script using the Human Nephrogenesis Atlas - API 
# When querying one specific gene, the API queries some other genes of the gene functional network as well. Maybe this can be somehow avoided to make everything faster, but I didn't find out how. 
# TODO: Replace paths

library(httr)
library(jsonlite)
library(tidyverse)
library(progress)

# download HGNC gene table from github repository "kidney-genetics"
hgnc_gt_version <- "2023-06-21"
gene_table_url <- paste0("https://github.com/halbritter-lab/kidney-genetics/raw/main/analyses/A_AnnotationHGNC/results/non_alt_loci_set_coordinates.", hgnc_gt_version, ".csv.gz")
download.file(url = gene_table_url, destfile = paste0("gene_score/raw/HGNC_", hgnc_gt_version, ".csv.gz")) # TODO: Replace paths
gunzip(filename = paste0("gene_score/raw/HGNC_", hgnc_gt_version, ".csv.gz"), 
       destname = paste0("gene_score/raw/HGNC_", hgnc_gt_version, ".csv")) # TODO: Replace paths

# load HGNC gene table and filter for protein-coding genes 
HGNC_table <- read.csv("gene_score/raw/HGNC_2023-06-21.csv") %>% 
  filter(locus_group == "protein-coding gene") # TODO: Replace paths

# function for getting expression values via the Human nephrogenesis atlas API
get_nga_expression <- function(cell_cluster, entrez_id_vec){
  # set query url
  entrez_id_collapse <- paste(paste0("entrez=", entrez_id_vec), collapse = "&")
  url <- "https://sckidney.flatironinstitute.org/integrations/query/"
  query_url <- paste0(url, cell_cluster, "?", entrez_id_collapse)

  # set the header
  headers <- c("Content-Type" = "application/json")

  # get the response
  response <- GET(query_url, add_headers(headers))

  # get the response content
  content <- content(response, as="text", encoding = "UTF-8")
  content_list <- fromJSON(content)

  # select queried genes
  res <- data.frame(content_list$genes) %>% 
    filter(query == TRUE) %>% 
    dplyr::select(standard_name, systematic_name, entrez_id = entrez, expression_value) %>% 
    mutate(cell_cluster = cell_cluster)

  return(res)
}

# get expression values chunk wise (all genes at once not possible)
entrez_table <- HGNC_table %>% dplyr::select(entrez_id) %>% unique()
entrez_split_list <- split(entrez_table$entrez_id, ceiling(seq_along(entrez_table$entrez_id) / 250)) 
expr_vals_df <- data.frame()
cell_clusters <- c(57) # 57 = Developmental Kidney
pb <- progress_bar$new(total = length(entrez_split_list)*length(cell_clusters))

for (cell_cluster in cell_clusters){
  for (i in entrez_split_list){
    pb$tick()  # Increment progress bar
    cat("\n")
    sub_df <- get_nga_expression(cell_cluster, i)

    expr_vals_df <- dplyr::bind_rows(expr_vals_df, sub_df)
  }
}

View(expr_vals_df)

# write csv ...