The matrix of object "met4" whether was directly from "metaphlan4"?

HuaZou commented 3 weeks ago

Hello, I have been working with the TOPOSCORE tool and have successfully executed the associated code. However, I have encountered a query regarding the 'met4' object. I have utilized the following code snippets to verify the origin of the 'met4' object, presuming it to be directly derived from the 'metaphlan4'. During this process, I noticed that the total relative abundance for each sample does not sum up to 100%. This discrepancy has raised some concerns regarding the preprocessing steps that might have been applied to the data matrix. Could you kindly provide some guidance or clarification on this matter?

... # previous codes were in tp_main.R

rowSums(met4[, -1]) %>% data.frame() %>% setNames("SUM_Abundance") -> met4_SumSample
met4_SumSample$Sample_id <- met4$Sample_id
View(met4_SumSample)

valerioiebba commented 3 weeks ago

Dear @HuaZou , Thank you for your interest and success in using TOPOSCORE ! You're right, actually Met4 stands for Metaphlan4. The discrepancy that you noticed derived from errors due to platform analysis exchange (Ubuntu/Excel): after a careful revision the following files were added in replacement of the old DS2 (these samples fastq files have been already deposited in SRA database):

DS2_oncology_microbiome_data_DiscValid.csv
DS2_oncology_microbiome_data_RCC.csv

If you encounter other issues do not hesitate to write ! Best V.

HuaZou commented 3 weeks ago

Dear @HuaZou , Thank you for your interest and success in using TOPOSCORE ! You're right, actually Met4 stands for Metaphlan4. The discrepancy that you noticed derived from errors due to platform analysis exchange (Ubuntu/Excel): after a careful revision the following files were added in replacement of the old DS2 (these samples fastq files have been already deposited in SRA database):

DS2_oncology_microbiome_data_DiscValid.csv

DS2_oncology_microbiome_data_RCC.csv

If you encounter other issues do not hesitate to write ! Best V.

Dear Dr. Valerioiebba. Thank you so much for your kind reply.

Using the updated DS2 data, I have checked the differences (samples and species, also the TOPOSCORE) between the previous DS2 and updated DS2 data. I still had some questions:

_1. unmached samples and species: LUM3_976 in both DiscValid (NSCLS) and RCC dataset, which may be wrong._

## Old version (Abundance sum of samples not equal to 100%): DS2_oncology_microbiome_data-old.csv
## New version: 1.DS2_oncology_microbiome_data_DiscValid.csv; 2.DS2_oncology_microbiome_data_RCC.csv

library(tidyverse)
library(data.table)

# previous DS2 data
DS2_previous <- fread("DS2_oncology_microbiome_data-old.csv") # the previous DS2 data contained Discovery, Validation (NSCLC, RCC, UC) 

# updated DS2 data
DiscValid <- fread("DS2_oncology_microbiome_data_DiscValid.csv") # updated DS2 data 2024/8/22
RCC <- fread("DS2_oncology_microbiome_data_RCC.csv") # updated DS2 data 2024/8/22

# integrated updated DS2 data
DS2_new <- RCC %>%
  dplyr::rename(Taxa = Sample_id) %>%
  dplyr::full_join(
    DiscValid %>%
      tibble::column_to_rownames("Sample_id") %>%
      t() %>% as.data.frame() %>%
      tibble::rownames_to_column("Taxa"),
    by = "Taxa") %>%
  tibble::column_to_rownames("Taxa") %>%
  t() %>% as.data.frame() %>%
  tibble::rownames_to_column("Sample_id")  

# replaced missing values by 0
DS2_new[is.na(DS2_new)] <- 0
# write.csv(DS2_new, "DS2_oncology_microbiome_data.csv", row.names = F) # updated DS2 for following TOPOSCORE analysis

# differences (samples and species) between previous and updated DS2 Data
nrow(DS2_previous) - nrow(DS2_new) # 133 sampels
diff_samples <- setdiff(DS2_previous$Sample_id, DS2_new$Sample_id) # (135 samples) "LUM3_976", "LUM3_937", "2302005"-"2319041"
# setdiff(DS2_new$Sample_id, DS2_previous$Sample_id) "LUM3_976.x" "LUM3_976.y"

ncol(DS2_previous) - ncol(DS2_new) # 101 species
diff_species <- setdiff(colnames(DS2_previous), colnames(DS2_new)) # (191 species) "s__Clostridium_innocuum"

# NO. of DiscValid & RCC provided by paper (previous: 715; updated: 582)
## Discovery dataset: 245
## Validation dataset: 254
## RCC: 83 
## UC: 133 (not provided in updated DS2 data)

# Total abundance per sample
rowSums(DS2_new[, -1]) %>% data.frame() %>% setNames("SUM_Abundance") -> met4_SumSample
met4_SumSample$Sample_id <- DS2_new$Sample_id

View(met4_SumSample)

The cutoff values for TOPOSCORE, as determined by the updated DS2 data, were not 0.5351351 and 0.7911411, even though the distribution of TOPOSCORE was consistent with the previous analysis, using the same pipeline.

... # the 1.1-1.4 steps

### 1.5 Toposcoring ----
scores_disc <- compute_toposcore(met4_disc, sigb1 = SIGB1, sigb2 = SIGB2)
pred_disc <- clin_disc %>% 
  dplyr::left_join(scores_disc, by = 'Sample_id') %>% 
  dplyr::filter(OS12 != '') %>% 
  dplyr::mutate(OS12 = factor(OS12, levels = c('NR', 'R')))
roc <- calc_roc(pred_disc$OS12, pred_disc$TOPOB01, verbose = TRUE)
log_msg('ROC AUC = %.2f [%.2f - %.2f]', roc$AUC[1], roc$AUC[2], roc$AUC[3])
youden <- roc$ROC_DF %>% mutate(SENS = TPR, SPEC = 1 - FPR) %>% mutate(J = SENS + SPEC - 1)
ggplot(youden, aes(x = THRESHOLD, y = J)) + geom_point()
ycut_nr <- youden[which.max(youden$J), ] # 0.5351351 (updated: 0.8027027)
ycut_r <- youden[which(youden$THRESHOLD > 0.7 & youden$J > 0.23), ] # 0.7911411 (updated: 0.7954955-0.8073574)
log_msg('Cut-off thresholds = %.4f and %.4f', ycut_nr$THRESHOLD, ycut_r$THRESHOLD)
plt_roc <- plot_roc(roc$ROC_DF) + 
  geom_point(data = ycut_nr, color = 'red') +
  geom_point(data = ycut_r, color = 'green')
ggsave(plt_roc, filename = 'fig_roc.pdf', width = 10, height = 10, units = "cm")
plt_kde_disc <- plot_toposcoreb01_density(
  scores_disc, 
  clin_disc, 
  lims = c(ycut_r$THRESHOLD, ycut_nr$THRESHOLD))
ggsave(plt_kde_disc, filename = 'fig_kde_disc.pdf', width = 10, height = 10, units = "cm")

I appreciate your assistance once again, and thank you in advance.

adhal007 commented 1 week ago

@HuaZou Could share the full code for steps # the 1.1-1.4 steps so i could help with the debugging and fixing of the code? Thanks

HuaZou commented 1 week ago

@HuaZou Could share the full code for steps # the 1.1-1.4 steps so i could help with the debugging and fixing of the code? Thanks

Thank you so much. Here were the codes

# load required libraries and helper functions
source('tp_helper.R')

## 1. Discovery analysis set ----
log_msg('####### Discovery analysis set #########')
clin_disc <- load_clin(cohort = 'Disc')
met4_disc <- load_microbiome(clin_disc)

### 1.1 CoxPH screen (except Akkermansia) ----
res_surv <- load_or_compute('res_surv2.rds',
                          screen_surv_met4(clin_disc, met4_disc, type = 'OS'))
plot_surv_forest(res_surv, alpha = 0.05)
# select species based on average HR:
res_surv_filt <- res_surv %>% 
  dplyr::filter(HR <= 0.8 | HR >= 1.25)
selected_species <- res_surv_filt$SPECIES
log_msg('%d/%d species selected', length(selected_species), nrow(res_surv))
plot_surv_forest(res_surv_filt %>% dplyr::arrange(HR))
met4_disc_filt <- met4_disc[, c('Sample_id', selected_species)]
hr_annots <- res_surv_filt %>% 
  dplyr::mutate(HRCAT = ifelse(HR < 1, 'R', 'NR')) %>% 
  dplyr::select(HRCAT, SPECIES)

### 1.2 Correlation screen ----
res <- load_or_compute('res_pairs2.rds', screen_pairs(met4_disc_filt))
# filter based on Fisher Bonferroni-corrected p-values <= 0.05
res_filt <- load_or_compute('res_pairs_filt2.rds', {
  min_p <- bind_rows(list(
    res %>% dplyr::select(VAR = VAR1, FISHER_P),
    res %>% dplyr::select(VAR = VAR2, FISHER_P)
  )) %>% 
    dplyr::group_by(VAR) %>% 
    dplyr::summarize(MIN_P = min(FISHER_P)) %>% 
    dplyr::arrange(-MIN_P)
  sp2 <- min_p[min_p$MIN_P <= 0.05 / nrow(min_p), 'VAR', drop = TRUE]
  res %>% 
    dplyr::filter(VAR1 %in% sp2 & VAR2 %in% sp2)
})
log_msg('Keeping %d pairs (%d species)', nrow(res_filt), 
        length(unique(c(res_filt$VAR1, res_filt$VAR2))))

### 1.3 Clustering ----
SCORE <- 'fisher_p'
METHOD <- 'ward.D2'
DISTANCE <- 'manhattan'
cc <- cluster_species(res_filt, score = SCORE, method = METHOD, distance = DISTANCE, k = 7) %>% 
  renumber_clusters()
plt_fisher_disc <- plot_score_matrix(
  res_filt, 
  score = SCORE, 
  method = METHOD, 
  distance = DISTANCE, 
  annots = list(cc, hr_annots), 
  fontsize = 3)

ggsave(plt_fisher_disc, filename = 'fig_fisher_disc.pdf', width = 30, height = 20, units = "cm")

### 1.4 Definition of SIGB groups ----
cc_names <- unique(cc$CLUSTER)
cc_species <- setNames(lapply(cc_names, function(x) cc[cc$CLUSTER == x, 'SPECIES', drop = TRUE]), cc_names)
SIGB1 <- load_or_compute('sigb12.rds', c(cc_species$C5, cc_species$C6))
SIGB2 <- load_or_compute('sigb22.rds', c(cc_species$C1, cc_species$C2, cc_species$C3))

### 1.5 Toposcoring ----
scores_disc <- compute_toposcore(met4_disc, sigb1 = SIGB1, sigb2 = SIGB2)
pred_disc <- clin_disc %>% 
  dplyr::left_join(scores_disc, by = 'Sample_id') %>% 
  dplyr::filter(OS12 != '') %>% 
  dplyr::mutate(OS12 = factor(OS12, levels = c('NR', 'R')))
roc <- calc_roc(pred_disc$OS12, pred_disc$TOPOB01, verbose = TRUE)
log_msg('ROC AUC = %.2f [%.2f - %.2f]', roc$AUC[1], roc$AUC[2], roc$AUC[3])
youden <- roc$ROC_DF %>% mutate(SENS = TPR, SPEC = 1 - FPR) %>% mutate(J = SENS + SPEC - 1)
ggplot(youden, aes(x = THRESHOLD, y = J)) + geom_point()
ycut_nr <- youden[which.max(youden$J), ] # 0.5351351
ycut_r <- youden[which(youden$THRESHOLD > 0.7 & youden$J > 0.23), ] # 0.7911411
log_msg('Cut-off thresholds = %.4f and %.4f', ycut_nr$THRESHOLD, ycut_r$THRESHOLD)
plt_roc <- plot_roc(roc$ROC_DF) + 
  geom_point(data = ycut_nr, color = 'red') +
  geom_point(data = ycut_r, color = 'green')
ggsave(plt_roc, filename = 'fig_roc.pdf', width = 10, height = 10, units = "cm")
plt_kde_disc <- plot_toposcoreb01_density(
  scores_disc, 
  clin_disc, 
  lims = c(ycut_r$THRESHOLD, ycut_nr$THRESHOLD))
ggsave(plt_kde_disc, filename = 'fig_kde_disc.pdf', width = 10, height = 10, units = "cm")

Thank you in advance for your help.

valerioiebba / TOPOSCORE

The matrix of object "met4" whether was directly from "metaphlan4"? #1