Open HuaZou opened 3 weeks ago
Dear @HuaZou , Thank you for your interest and success in using TOPOSCORE ! You're right, actually Met4 stands for Metaphlan4. The discrepancy that you noticed derived from errors due to platform analysis exchange (Ubuntu/Excel): after a careful revision the following files were added in replacement of the old DS2 (these samples fastq files have been already deposited in SRA database):
If you encounter other issues do not hesitate to write ! Best V.
Dear @HuaZou , Thank you for your interest and success in using TOPOSCORE ! You're right, actually Met4 stands for Metaphlan4. The discrepancy that you noticed derived from errors due to platform analysis exchange (Ubuntu/Excel): after a careful revision the following files were added in replacement of the old DS2 (these samples fastq files have been already deposited in SRA database):
- DS2_oncology_microbiome_data_DiscValid.csv
- DS2_oncology_microbiome_data_RCC.csv
If you encounter other issues do not hesitate to write ! Best V.
Dear Dr. Valerioiebba. Thank you so much for your kind reply.
Using the updated DS2 data, I have checked the differences (samples and species, also the TOPOSCORE) between the previous DS2 and updated DS2 data. I still had some questions:
_1. unmached samples and species: LUM3_976 in both DiscValid (NSCLS) and RCC dataset, which may be wrong._
## Old version (Abundance sum of samples not equal to 100%): DS2_oncology_microbiome_data-old.csv
## New version: 1.DS2_oncology_microbiome_data_DiscValid.csv; 2.DS2_oncology_microbiome_data_RCC.csv
library(tidyverse)
library(data.table)
# previous DS2 data
DS2_previous <- fread("DS2_oncology_microbiome_data-old.csv") # the previous DS2 data contained Discovery, Validation (NSCLC, RCC, UC)
# updated DS2 data
DiscValid <- fread("DS2_oncology_microbiome_data_DiscValid.csv") # updated DS2 data 2024/8/22
RCC <- fread("DS2_oncology_microbiome_data_RCC.csv") # updated DS2 data 2024/8/22
# integrated updated DS2 data
DS2_new <- RCC %>%
dplyr::rename(Taxa = Sample_id) %>%
dplyr::full_join(
DiscValid %>%
tibble::column_to_rownames("Sample_id") %>%
t() %>% as.data.frame() %>%
tibble::rownames_to_column("Taxa"),
by = "Taxa") %>%
tibble::column_to_rownames("Taxa") %>%
t() %>% as.data.frame() %>%
tibble::rownames_to_column("Sample_id")
# replaced missing values by 0
DS2_new[is.na(DS2_new)] <- 0
# write.csv(DS2_new, "DS2_oncology_microbiome_data.csv", row.names = F) # updated DS2 for following TOPOSCORE analysis
# differences (samples and species) between previous and updated DS2 Data
nrow(DS2_previous) - nrow(DS2_new) # 133 sampels
diff_samples <- setdiff(DS2_previous$Sample_id, DS2_new$Sample_id) # (135 samples) "LUM3_976", "LUM3_937", "2302005"-"2319041"
# setdiff(DS2_new$Sample_id, DS2_previous$Sample_id) "LUM3_976.x" "LUM3_976.y"
ncol(DS2_previous) - ncol(DS2_new) # 101 species
diff_species <- setdiff(colnames(DS2_previous), colnames(DS2_new)) # (191 species) "s__Clostridium_innocuum"
# NO. of DiscValid & RCC provided by paper (previous: 715; updated: 582)
## Discovery dataset: 245
## Validation dataset: 254
## RCC: 83
## UC: 133 (not provided in updated DS2 data)
# Total abundance per sample
rowSums(DS2_new[, -1]) %>% data.frame() %>% setNames("SUM_Abundance") -> met4_SumSample
met4_SumSample$Sample_id <- DS2_new$Sample_id
View(met4_SumSample)
... # the 1.1-1.4 steps
### 1.5 Toposcoring ----
scores_disc <- compute_toposcore(met4_disc, sigb1 = SIGB1, sigb2 = SIGB2)
pred_disc <- clin_disc %>%
dplyr::left_join(scores_disc, by = 'Sample_id') %>%
dplyr::filter(OS12 != '') %>%
dplyr::mutate(OS12 = factor(OS12, levels = c('NR', 'R')))
roc <- calc_roc(pred_disc$OS12, pred_disc$TOPOB01, verbose = TRUE)
log_msg('ROC AUC = %.2f [%.2f - %.2f]', roc$AUC[1], roc$AUC[2], roc$AUC[3])
youden <- roc$ROC_DF %>% mutate(SENS = TPR, SPEC = 1 - FPR) %>% mutate(J = SENS + SPEC - 1)
ggplot(youden, aes(x = THRESHOLD, y = J)) + geom_point()
ycut_nr <- youden[which.max(youden$J), ] # 0.5351351 (updated: 0.8027027)
ycut_r <- youden[which(youden$THRESHOLD > 0.7 & youden$J > 0.23), ] # 0.7911411 (updated: 0.7954955-0.8073574)
log_msg('Cut-off thresholds = %.4f and %.4f', ycut_nr$THRESHOLD, ycut_r$THRESHOLD)
plt_roc <- plot_roc(roc$ROC_DF) +
geom_point(data = ycut_nr, color = 'red') +
geom_point(data = ycut_r, color = 'green')
ggsave(plt_roc, filename = 'fig_roc.pdf', width = 10, height = 10, units = "cm")
plt_kde_disc <- plot_toposcoreb01_density(
scores_disc,
clin_disc,
lims = c(ycut_r$THRESHOLD, ycut_nr$THRESHOLD))
ggsave(plt_kde_disc, filename = 'fig_kde_disc.pdf', width = 10, height = 10, units = "cm")
I appreciate your assistance once again, and thank you in advance.
@HuaZou Could share the full code for steps # the 1.1-1.4 steps so i could help with the debugging and fixing of the code? Thanks
@HuaZou Could share the full code for steps # the 1.1-1.4 steps so i could help with the debugging and fixing of the code? Thanks
Thank you so much. Here were the codes
# load required libraries and helper functions
source('tp_helper.R')
## 1. Discovery analysis set ----
log_msg('####### Discovery analysis set #########')
clin_disc <- load_clin(cohort = 'Disc')
met4_disc <- load_microbiome(clin_disc)
### 1.1 CoxPH screen (except Akkermansia) ----
res_surv <- load_or_compute('res_surv2.rds',
screen_surv_met4(clin_disc, met4_disc, type = 'OS'))
plot_surv_forest(res_surv, alpha = 0.05)
# select species based on average HR:
res_surv_filt <- res_surv %>%
dplyr::filter(HR <= 0.8 | HR >= 1.25)
selected_species <- res_surv_filt$SPECIES
log_msg('%d/%d species selected', length(selected_species), nrow(res_surv))
plot_surv_forest(res_surv_filt %>% dplyr::arrange(HR))
met4_disc_filt <- met4_disc[, c('Sample_id', selected_species)]
hr_annots <- res_surv_filt %>%
dplyr::mutate(HRCAT = ifelse(HR < 1, 'R', 'NR')) %>%
dplyr::select(HRCAT, SPECIES)
### 1.2 Correlation screen ----
res <- load_or_compute('res_pairs2.rds', screen_pairs(met4_disc_filt))
# filter based on Fisher Bonferroni-corrected p-values <= 0.05
res_filt <- load_or_compute('res_pairs_filt2.rds', {
min_p <- bind_rows(list(
res %>% dplyr::select(VAR = VAR1, FISHER_P),
res %>% dplyr::select(VAR = VAR2, FISHER_P)
)) %>%
dplyr::group_by(VAR) %>%
dplyr::summarize(MIN_P = min(FISHER_P)) %>%
dplyr::arrange(-MIN_P)
sp2 <- min_p[min_p$MIN_P <= 0.05 / nrow(min_p), 'VAR', drop = TRUE]
res %>%
dplyr::filter(VAR1 %in% sp2 & VAR2 %in% sp2)
})
log_msg('Keeping %d pairs (%d species)', nrow(res_filt),
length(unique(c(res_filt$VAR1, res_filt$VAR2))))
### 1.3 Clustering ----
SCORE <- 'fisher_p'
METHOD <- 'ward.D2'
DISTANCE <- 'manhattan'
cc <- cluster_species(res_filt, score = SCORE, method = METHOD, distance = DISTANCE, k = 7) %>%
renumber_clusters()
plt_fisher_disc <- plot_score_matrix(
res_filt,
score = SCORE,
method = METHOD,
distance = DISTANCE,
annots = list(cc, hr_annots),
fontsize = 3)
ggsave(plt_fisher_disc, filename = 'fig_fisher_disc.pdf', width = 30, height = 20, units = "cm")
### 1.4 Definition of SIGB groups ----
cc_names <- unique(cc$CLUSTER)
cc_species <- setNames(lapply(cc_names, function(x) cc[cc$CLUSTER == x, 'SPECIES', drop = TRUE]), cc_names)
SIGB1 <- load_or_compute('sigb12.rds', c(cc_species$C5, cc_species$C6))
SIGB2 <- load_or_compute('sigb22.rds', c(cc_species$C1, cc_species$C2, cc_species$C3))
### 1.5 Toposcoring ----
scores_disc <- compute_toposcore(met4_disc, sigb1 = SIGB1, sigb2 = SIGB2)
pred_disc <- clin_disc %>%
dplyr::left_join(scores_disc, by = 'Sample_id') %>%
dplyr::filter(OS12 != '') %>%
dplyr::mutate(OS12 = factor(OS12, levels = c('NR', 'R')))
roc <- calc_roc(pred_disc$OS12, pred_disc$TOPOB01, verbose = TRUE)
log_msg('ROC AUC = %.2f [%.2f - %.2f]', roc$AUC[1], roc$AUC[2], roc$AUC[3])
youden <- roc$ROC_DF %>% mutate(SENS = TPR, SPEC = 1 - FPR) %>% mutate(J = SENS + SPEC - 1)
ggplot(youden, aes(x = THRESHOLD, y = J)) + geom_point()
ycut_nr <- youden[which.max(youden$J), ] # 0.5351351
ycut_r <- youden[which(youden$THRESHOLD > 0.7 & youden$J > 0.23), ] # 0.7911411
log_msg('Cut-off thresholds = %.4f and %.4f', ycut_nr$THRESHOLD, ycut_r$THRESHOLD)
plt_roc <- plot_roc(roc$ROC_DF) +
geom_point(data = ycut_nr, color = 'red') +
geom_point(data = ycut_r, color = 'green')
ggsave(plt_roc, filename = 'fig_roc.pdf', width = 10, height = 10, units = "cm")
plt_kde_disc <- plot_toposcoreb01_density(
scores_disc,
clin_disc,
lims = c(ycut_r$THRESHOLD, ycut_nr$THRESHOLD))
ggsave(plt_kde_disc, filename = 'fig_kde_disc.pdf', width = 10, height = 10, units = "cm")
Thank you in advance for your help.
Hello, I have been working with the TOPOSCORE tool and have successfully executed the associated code. However, I have encountered a query regarding the 'met4' object. I have utilized the following code snippets to verify the origin of the 'met4' object, presuming it to be directly derived from the 'metaphlan4'. During this process, I noticed that the total relative abundance for each sample does not sum up to 100%. This discrepancy has raised some concerns regarding the preprocessing steps that might have been applied to the data matrix. Could you kindly provide some guidance or clarification on this matter?