Open ninarank opened 1 year ago
# This is a pre-version of a script using the Human Nephrogenesis Atlas - API
# When querying one specific gene, the API queries some other genes of the gene functional network as well. Maybe this can be somehow avoided to make everything faster, but I didn't find out how.
# TODO: Replace paths
library(httr)
library(jsonlite)
library(tidyverse)
library(progress)
# download HGNC gene table from github repository "kidney-genetics"
hgnc_gt_version <- "2023-06-21"
gene_table_url <- paste0("https://github.com/halbritter-lab/kidney-genetics/raw/main/analyses/A_AnnotationHGNC/results/non_alt_loci_set_coordinates.", hgnc_gt_version, ".csv.gz")
download.file(url = gene_table_url, destfile = paste0("gene_score/raw/HGNC_", hgnc_gt_version, ".csv.gz")) # TODO: Replace paths
gunzip(filename = paste0("gene_score/raw/HGNC_", hgnc_gt_version, ".csv.gz"),
destname = paste0("gene_score/raw/HGNC_", hgnc_gt_version, ".csv")) # TODO: Replace paths
# load HGNC gene table and filter for protein-coding genes
HGNC_table <- read.csv("gene_score/raw/HGNC_2023-06-21.csv") %>%
filter(locus_group == "protein-coding gene") # TODO: Replace paths
# function for getting expression values via the Human nephrogenesis atlas API
get_nga_expression <- function(cell_cluster, entrez_id_vec){
# set query url
entrez_id_collapse <- paste(paste0("entrez=", entrez_id_vec), collapse = "&")
url <- "https://sckidney.flatironinstitute.org/integrations/query/"
query_url <- paste0(url, cell_cluster, "?", entrez_id_collapse)
# set the header
headers <- c("Content-Type" = "application/json")
# get the response
response <- GET(query_url, add_headers(headers))
# get the response content
content <- content(response, as="text", encoding = "UTF-8")
content_list <- fromJSON(content)
# select queried genes
res <- data.frame(content_list$genes) %>%
filter(query == TRUE) %>%
dplyr::select(standard_name, systematic_name, entrez_id = entrez, expression_value) %>%
mutate(cell_cluster = cell_cluster)
return(res)
}
# get expression values chunk wise (all genes at once not possible)
entrez_table <- HGNC_table %>% dplyr::select(entrez_id) %>% unique()
entrez_split_list <- split(entrez_table$entrez_id, ceiling(seq_along(entrez_table$entrez_id) / 250))
expr_vals_df <- data.frame()
cell_clusters <- c(57) # 57 = Developmental Kidney
pb <- progress_bar$new(total = length(entrez_split_list)*length(cell_clusters))
for (cell_cluster in cell_clusters){
for (i in entrez_split_list){
pb$tick() # Increment progress bar
cat("\n")
sub_df <- get_nga_expression(cell_cluster, i)
expr_vals_df <- dplyr::bind_rows(expr_vals_df, sub_df)
}
}
View(expr_vals_df)
# write csv ...
Instead of or in addition to using Descartes fetal gene expression, consider using "Human Nephrogenesis Atlas".
The "Human Nephrogenesis Atlas" provides ssRNA seq data of two 14-weeks human kidneys (after QC: 8,316 cells (6,667 nephrogenic)). They identified 18 cell clusters in the developing kidney. For each gene, you can get expression values of each cell cluster or of the total "Developmental Kidney". I suppose the "Developmental Kidney" expression value could be used if you only want 1 value per gene for the fetal kidney. This values is accessible via an API (script in progress).
So far, I did not find out, how exactly they calculate this expression value (which normalization technique etc.). I screened the original paper, but they didn't specify it there. I have not read the paper in detail, yet.
https://sckidney.flatironinstitute.org/ To get an example expression value: => Tab 'Gene Network' => select 'Developmental Kidney' => select gene of interest