Error when running get_DE_info()

saeyslab / multinichenetr

MultiNicheNet: a flexible framework for differential cell-cell communication analysis from multi-sample multi-condition single-cell transcriptomics data

GNU General Public License v3.0

112 stars 14 forks source link

Error when running get_DE_info() #7

Closed VivianQM closed 1 year ago

VivianQM commented 1 year ago

Hi,

Thank you so much for this amazing update. I am trying to use it for a 3-way comparison and have been following the tutorial: https://github.com/saeyslab/multinichenetr/blob/main/vignettes/basic_analysis_steps_MISC.md

When performing the DE analysis for each cell type, however, I ran into this

Warning message: “Unknown or uninitialised column: cluster_id.” [1] "excluded cell types are:" [1] "Granulosa" "Theca"
[1] "These celltypes are not considered in the analysis. After removing samples that contain less cells than the required minimal, some groups don't have 2 or more samples anymore. As a result the analysis cannot be run. To solve this: decrease the number of min_cells or change your group_id and pool all samples that belong to groups that are not of interest! " [1] "None of the cell types passed the check. This might be due to 2 reasons. 1) no cell type has enough cells in >=2 samples per group. 2) problem in batch definition: not all levels of your batch are in each group - Also for groups not included in your contrasts!" Error in dplyr::group_by(): ! Must group by variables found in .data. Column contrast is not found. Column cluster_id is not found. Traceback:

get_DE_info(sce = sce, sample_id = sample_id, group_id = group_id, . celltype_id = celltype_id, covariates = covariates, contrasts_oi = contrasts_oi, . min_cells = 10)
celltype_de$de_output_tidy %>% dplyr::inner_join(celltype_de$de_output_tidy %>% . dplyr::group_by(contrast, cluster_id) %>% dplyr::count(), . by = c("cluster_id", "contrast")) %>% dplyr::mutate(cluster_id = paste0(cluster_id, . "\nnr of genes: ", n)) %>% dplyr::mutate(p-value <= 0.05 = p_val <= . 0.05) %>% ggplot(aes(x = p_val, fill = p-value <= 0.05))
ggplot(., aes(x = p_val, fill = p-value <= 0.05))
dplyr::mutate(., p-value <= 0.05 = p_val <= 0.05)
dplyr::mutate(., cluster_id = paste0(cluster_id, "\nnr of genes: ", . n))
dplyr::inner_join(., celltype_de$de_output_tidy %>% dplyr::group_by(contrast, . cluster_id) %>% dplyr::count(), by = c("cluster_id", "contrast"))
inner_join.data.frame(., celltype_de$de_output_tidy %>% dplyr::group_by(contrast, . cluster_id) %>% dplyr::count(), by = c("cluster_id", "contrast"))
auto_copy(x, y, copy = copy)
same_src(x, y)
same_src.data.frame(x, y)
is.data.frame(y)
celltype_de$de_output_tidy %>% dplyr::group_by(contrast, cluster_id) %>% . dplyr::count()
dplyr::count(.)
dplyr::group_by(., contrast, cluster_id)
group_by.data.frame(., contrast, cluster_id)
group_by_prepare(.data, ..., .add = .add, error_call = current_env())
abort(bullets, call = error_call)
signal_abort(cnd, .file)

I am very confident that I have enough cells in every group and cell type. Any suggestion on what might solve the problem? Any help is highly appreciated! Thank you so much.

joe-jhou2 commented 1 year ago

I have the same issue. get_DE_info rejected all my cell types. Some cell types have symbols (+ or /), or space. But even cell type named "CD4" was also excluded. Was that only bcz not enough cells for a given cell type in each group?

SergioRodLla commented 1 year ago

Hi, I am getting the same error following the same steps. As you, I am confident that the groups and cell-types I selected pass the threshold of the minimum amount of cells required. I also ensured that all the variables are syntactically valid (make.names).

browaeysrobin commented 1 year ago

Thank you all for raising this issue. We will try to fix it as soon as possible (hopefully next week).

browaeysrobin commented 1 year ago

Hi @VivianQM @mimisikai @SergioRodLla

I could replicate this issue by putting extremely high thresholds for genes to be expressed - as done in this example:

get_DE_info(sce, sample_id, group_id, celltype_id, batches, covariates, contrasts_oi, min_cells = 10, assay_oi_pb = "counts", fun_oi_pb = "sum", de_method_oi = "edgeR", findMarkers = FALSE, contrast_tbl = NULL, filterByExpr.min.count = 100000, filterByExpr.min.total.count = 100000, filterByExpr.large.n = 5, filterByExpr.min.prop = 0.7)

This might indicate that cell types will be removed if no genes pass the default gene filtering. Could you try to put filterByExpr.min.count = 0 and filterByExpr.min.total.count = 0 as an experiment to see whether the issue concerns gene filtering. Since these parameter settings will not perform gene filtering, the analysis should run if the issue is due to gene filtering.

If the issue seems to concern gene filtering: read the documentation of filterByExpr from edgeR to understand these filtering criteria, and think about whether this makes sense for your data. Could you then provide me with some info of this? If none of your cell types would pass this filtering criterion I would be very worried.

If the issue remains: could you share the following plot: abundance_expression_info$abund_plot_sample ?

SergioRodLla commented 1 year ago

Hi @browaeysrobin

Thank you for the fast reply. I tried to use filterByExpr.min.count = 0 and filterByExpr.min.total.count = 0 and it is giving me the same error. I forgot to mention that for my case, not all the groups in my grouping variable (group_id) reach the condition of >=2 samples with >=10 cells. However, the 3 contrasts I specified in contrasts_oi reach this requirement. Does this mean that I should remove those groups from my sce object even though they are not indicated as contrasts of interest? If I do this get_DE_info() runs without errors.

browaeysrobin commented 1 year ago

Hi @SergioRodLla

Does this mean that I should remove those groups from my sce object even though they are not indicated as contrasts of interest? If I do this get_DE_info() runs without errors.

Indeed. Cell types should have enough cells in >=2 samples in each group in your data. So even groups that are not directly compared in your contrast are considered during the DE model analysis. (that's also mentioned in the error message normally ... - Also for groups not included in your contrasts!.

To solve this, you can merge groups that are not of interest for your contrasts, or subset your data and leaving out these groups.

SergioRodLla commented 1 year ago

Thank you!

I wanted to make sure this was OK since at Step 3 of the introductory vignette (https://github.com/saeyslab/multinichenetr/blob/main/vignettes/basic_analysis_steps_MISC.md) I am getting a ligand_activities_targets_DEgenes$ligand_activities dataframe with rows with NA values in the "target" and "ligand_target_weighted" columns. Specially if I use adjusted p-value instead of normal p-value, which happens in nearly half of the rows. Is this because of gene names missing from the ligand_target_matrix? I am using the mouse Nichenet v2 networks.

Thank you again.

VivianQM commented 1 year ago

Hi @browaeysrobin,

Thanks for your reply. The issue remains in my case after trying to put filterByExpr.min.count = 0 and filterByExpr.min.total.count = 0. Here is my plot for abundance_expression_info$abund_plot_sample:

I also made sure that cell types have enough cells in >=2 samples in each group in sce (I subsetted the dataset)

Any clue why this is still happening? Any help is appreciated. Thank you!

catsargent commented 1 year ago

Hi,

I am encountering the same error:

> DE_info = get_DE_info(sce = sce, sample_id = sample_id, group_id = group_id, celltype_id = celltype_id, batches = batches, covariates = covariates, contrasts_oi = contrasts_oi, min_cells = min_cells)

[1] "excluded cell types are:"
 [1] "MAC"                     "SMACs"                   "VSM"                     "Tc"                      "Ductal.KCs"              "Classical.MECs"         
 [7] "SerACs"                  "Oral.Mucosal.KCs"        "ILCs"                    "Fibroblast.1"            "Fibroblast.2"            "Non.classical.MECs"     
[13] "Non.Classical.Monocytes" "Classical.Monocytes"     "Macrophage"              "cDC1"                    "Th"                      "Treg"                   
[19] "MAIT"                    "IgM.Plasma.Cells"        "IgG.Plasma.Cells"        "CD8..IEL"                "IgA.Plasma.Cells"        "Plasmablasts"           
[25] "CD16..NK.Cells"          "VECs"                    "Transitional.Ductal.KCs" "Fibroblast.3"            "Pericytes"               "LECs"                   
[1] "These celltypes are not considered in the analysis. After removing samples that contain less cells than the required minimal, some groups don't have 2 or more samples anymore. As a result the analysis cannot be run. To solve this: decrease the number of min_cells or change your group_id and pool all samples that belong to groups that are not of interest! "
[1] "None of the cell types passed the check. This might be due to 2 reasons. 1) no cell type has enough cells in >=2 samples per group. 2) problem in batch definition: not all levels of your batch are in each group - Also for groups not included in your contrasts!"
Error in `dplyr::group_by()`:
! Must group by variables found in `.data`.
Column `contrast` is not found.
Column `cluster_id` is not found.
Run `rlang::last_trace()` to see where the error occurred.
Warning message:
Unknown or uninitialised column: `cluster_id`.

I expected a number of cell types to be excluded but not all. Here is the plot for abundance_expression_info$abund_plot_sample: cell_type_abundance_per_sample_masked

I then subset the sce object to include only the cell types that passed the criteria of >= 2 samples per group with 10 cells or more and I get the same error. I also tried using filterByExpr.min.count = 0 and filterByExpr.min.total.count = 0 as suggested and nothing changed. I did not set any batches.

Any help would be much appreciated. Thanks!

browaeysrobin commented 1 year ago

Hi @catsargent, Hi @VivianQM

To help me further diagnose the problem:

1) Can you run get_DE_info with the argument: findMarkers = TRUE instead the default findMarkers = FALSE. Do you then get the same error or not?

2) Can you merge your groups/samples so you have only groups with 3 or more samples per group (just for a test, not for your analysis). Do you then get the same error or not?

3) Can you show how you defined the contrasts, batches and covariates?

catsargent commented 1 year ago

Thanks for the quick reply. I tried both of those suggestions and unfortunately they both resulted in the same original error.

browaeysrobin commented 1 year ago

@catsargent

Thanks for the quick reply. I tried both of those suggestions and unfortunately they both resulted in the same original error.

and can you show how you defined the contrasts, batches and covariates?

catsargent commented 1 year ago

Sure! Here you go..

sample_id = "Sample"
group_id = "Niche"
celltype_id = "cell_type"
covariates = NA
batches = NA

contrasts_oi = c('Parotid.Gland-(Minor.Salivary.Gland+Submandibular.Gland)/2','Minor.Salivary.Gland-(Parotid.Gland+Submandibular.Gland)/2','Submandibular.Gland-(Minor.Salivary.Gland+Parotid.Gland)/2')

contrast_tbl = tibble(contrast = 
                        c("Parotid.Gland-(Minor.Salivary.Gland+Submandibular.Gland)/2","Minor.Salivary.Gland-(Parotid.Gland+Submandibular.Gland)/2", "Submandibular.Gland-(Minor.Salivary.Gland+Parotid.Gland)/2"), 
                      group = c("Parotid.Gland","Minor.Salivary.Gland","Submandibular.Gland"))

browaeysrobin commented 1 year ago

@catsargent

Everything looks fine at first sight. I have no idea what is going wrong and it's very hard when I cannot reproduce your errors myself.

To better see where exactly things might go wrong, can you run the following snippet of code in the place where you would normally run get_DE_info()?

celltypes = SummarizedExperiment::colData(sce)[,celltype_id] %>% unique()
contrasts = contrasts_oi

perform_muscat_de_analysis_test = function(sce, sample_id, celltype_id, group_id, batches, covariates, contrasts, assay_oi_pb = "counts", fun_oi_pb = "sum", de_method_oi = "edgeR", min_cells = 10, filterByExpr.min.count = 7, filterByExpr.min.total.count = 15, filterByExpr.large.n = 4, filterByExpr.min.prop = 0.7){
  # conditions of interest in the contrast should be present in the in the group column of the metadata
  groups_oi = SummarizedExperiment::colData(sce)[,group_id] %>% unique()
  conditions_oi = stringr::str_split(contrasts, "'") %>% unlist() %>% unique() %>%
    # stringr::str_split("[:digit:]") %>% unlist() %>% unique() %>%
    stringr::str_split("\\)") %>% unlist() %>% unique() %>%
    stringr::str_split("\\(") %>% unlist() %>% unique() %>%
    stringr::str_split("-") %>% unlist() %>% unique() %>%
    stringr::str_split("\\+") %>% unlist() %>% unique() %>%
    stringr::str_split("\\*") %>% unlist() %>% unique() %>%
    stringr::str_split("\\/") %>% unlist() %>% unique() %>% generics::setdiff(c("",","," ,", ", ")) %>% unlist() %>% unique()
  conditions_oi = conditions_oi[is.na(suppressWarnings(as.numeric(conditions_oi)))]

  # conditions of interest in the contrast should be present in the in the contrast_tbl
  contrasts_simplified = stringr::str_split(contrasts, "'") %>% unlist() %>% unique() %>%
    stringr::str_split(",") %>% unlist() %>% unique() %>% generics::setdiff(c("",",")) %>% unlist() %>% unique()

  # prepare sce for the muscat pseudobulk analysis
  sce$id = sce[[sample_id]]
  sce = muscat::prepSCE(sce,
                              kid = celltype_id, # subpopulation assignments
                              gid = group_id,  # group IDs (ctrl/stim)
                              sid = "id",   # sample IDs (ctrl/stim.1234)
                              drop = FALSE)  # drop all other SummarizedExperiment::colData columns ----------------- change to false
  print("sce after prepSCE")
  print(sce)

  pb = muscat::aggregateData(sce,
                             assay = assay_oi_pb, fun = fun_oi_pb,
                             by = c("cluster_id", "sample_id"))
  print("pb after aggregateData")
  print(pb)

  # prepare the experiment info (ei) table if batches present
  if(length(batches) > 1){
    batches_present = TRUE
  } else {
    if(!is.na(batches)){
      batches_present = TRUE
    } else {
      batches_present = FALSE

    }
  }

  if(batches_present){
    extra_metadata = SummarizedExperiment::colData(sce)  %>% tibble::as_tibble() %>% dplyr::select(all_of(sample_id), all_of(batches)) %>% dplyr::distinct() %>% dplyr::mutate_all(factor)
  } else {
    extra_metadata = SummarizedExperiment::colData(sce) %>% tibble::as_tibble() %>% dplyr::select(all_of(sample_id)) %>% dplyr::distinct() %>% dplyr::mutate_all(factor) 
  } 
  if('sample_id' != sample_id){
    extra_metadata$sample_id = extra_metadata[[sample_id]]
  }
  ei = S4Vectors::metadata(sce)$experiment_info

  ei = ei %>%  dplyr::inner_join(extra_metadata, by = "sample_id")

  # prepare the experiment info (ei) table if covariates present
  if(length(covariates) > 1){
    covariates_present = TRUE
  } else {
    if(!is.na(covariates)){
      covariates_present = TRUE
    } else {
      covariates_present = FALSE

    }
  }

  if(covariates_present){
    extra_metadata = SummarizedExperiment::colData(sce)  %>% tibble::as_tibble() %>% dplyr::select(all_of(sample_id), all_of(covariates)) %>% dplyr::distinct() ## no factorization of the continuous covariates!
  } else {
    extra_metadata = SummarizedExperiment::colData(sce) %>% tibble::as_tibble() %>% dplyr::select(all_of(sample_id)) %>% dplyr::distinct() %>% dplyr::mutate_all(factor) 
  } 
  if('sample_id' != sample_id){
    extra_metadata$sample_id = extra_metadata[[sample_id]]
  }
  if(is.null(ei)){
    ei = S4Vectors::metadata(sce)$experiment_info
  }

  ei = ei %>%  dplyr::inner_join(extra_metadata, by = "sample_id")

  # prepare the design and contrast matrix for the muscat DE analysis

  if(batches_present == TRUE & covariates_present == FALSE){

    batches_string = paste0("ei$",batches) %>% paste(collapse = " + ")
    design = eval(parse(text=paste("model.matrix(~ 0 + ei$group_id + ", batches_string, " ) ",sep="")))
    dimnames(design) = list(ei$sample_id, c(levels(ei$group_id), make.names(colnames(design)[(length(levels(ei$group_id))+1):length(colnames(design))])) )

  } else if (batches_present == FALSE & covariates_present == TRUE) {

    covariates_string = paste0("ei$",covariates) %>% paste(collapse = " + ")
    design = eval(parse(text=paste("model.matrix(~ 0 + ei$group_id + ", covariates_string, " ) ",sep="")))
    dimnames(design) = list(ei$sample_id, c(levels(ei$group_id), make.names(colnames(design)[(length(levels(ei$group_id))+1):length(colnames(design))])) )

  } else if (batches_present == TRUE & covariates_present == TRUE) {

    batches_string = paste0("ei$",batches) %>% paste(collapse = " + ")
    covariates_string = paste0("ei$",covariates) %>% paste(collapse = " + ")

    design = eval(parse(text=paste("model.matrix(~ 0 + ei$group_id + ", batches_string, " + ", covariates_string, " ) ",sep="")))
    dimnames(design) = list(ei$sample_id, c(levels(ei$group_id), make.names(colnames(design)[(length(levels(ei$group_id))+1):length(colnames(design))])) )

  } else if (batches_present == FALSE & covariates_present == FALSE) {

    design = eval(parse(text=paste("model.matrix(~ 0 + ei$group_id) ",sep="")))
    dimnames(design) = list(ei$sample_id, c(levels(ei$group_id)))

  }

  contrast = eval(parse(text=paste("limma::makeContrasts(", contrasts, ",levels=design)",sep="")))

  # filter genes
  celltype = SummarizedExperiment::assayNames(pb) %>% .[1]
  pb = pb[edgeR::filterByExpr(y = SummarizedExperiment::assay(pb, celltype), design = design, min.count = filterByExpr.min.count, min.total.count = filterByExpr.min.total.count, large.n = filterByExpr.large.n, min.prop = filterByExpr.min.prop), ]
  print("pb after gene filtering")
  print(pb)

  # run DS analysis
  res = muscat::pbDS(pb, method = de_method_oi , design = design, contrast = contrast, min_cells = min_cells, verbose = FALSE, filter = "none")
  print("res")
  print(head(res[[1]][[1]][[1]]))

  de_output_tidy = muscat::resDS(sce, res, bind = "row", cpm = FALSE, frq = FALSE) %>% tibble::as_tibble() %>% dplyr::rename(p_adj = p_adj.glb)
  print("de_output_tidy")
  print(head(de_output_tidy))

  output = list(de_output = res, de_output_tidy = de_output_tidy)
  return(output)
}
DE_list = celltypes %>% lapply(function(celltype_oi, sce){
  sce_oi = sce[, SummarizedExperiment::colData(sce)[,celltype_id] == celltype_oi]
  DE_result = perform_muscat_de_analysis_test(sce = sce_oi, 
                                     sample_id = sample_id, 
                                     celltype_id = celltype_id, 
                                     group_id = group_id, 
                                     batches = batches, 
                                     covariates = covariates, 
                                     contrasts = contrasts_oi)
}, sce)

Can you send me the output and error/warnings? (You can do that via e-mail if you would prefer that)

VivianQM commented 1 year ago

Hi @browaeysrobin ,

Tried all of the suggestions you gave to @catsargent . None of them made a difference. This is how I defined my contrast and variables.

And I ran the block of code that you gave. No errors or warnings were given. I printed: "sce after prepSCE" "pb after aggregateData" - describes a singlecellexperiment object, assays "Granulosa" "pb after gene filtering" - describes a singlecellexperiment object, assays "Granulosa" "res" - a table of DEA results "de_output_tidy" - a table of DEA results (with negative folds marked red) And all the above again but with the assay being "Theca" (Theca and Granulosa are my two cell types).

Thanks for the help

browaeysrobin commented 1 year ago

Hi @VivianQM Since you did not encounter any error now: can you run this as well

  celltype_de = list(
    de_output = c(DE_list %>% purrr::map("de_output")),
    de_output_tidy = DE_list %>% purrr::map("de_output_tidy") %>% bind_rows()
    )
  excluded_celltypes = celltypes %>% generics::setdiff(celltype_de$de_output_tidy$cluster_id) %>% unique()
  if (length(excluded_celltypes) > 0) {
    print("excluded cell types are:")
    print(excluded_celltypes)
    print("These celltypes are not considered in the analysis. After removing samples that contain less cells than the required minimal, some groups don't have 2 or more samples anymore. As a result the analysis cannot be run. To solve this: decrease the number of min_cells or change your group_id and pool all samples that belong to groups that are not of interest! ")
  }
  if (length(excluded_celltypes) == length(celltypes)) {
    print("None of the cell types passed the check. This might be due to 2 reasons. 1) no cell type has enough cells in >=2 samples per group. 2) problem in batch definition: not all levels of your batch are in each group - Also for groups not included in your contrasts!")
  }

  hist_pvals = celltype_de$de_output_tidy %>% dplyr::inner_join(celltype_de$de_output_tidy %>% dplyr::group_by(contrast,cluster_id) %>% dplyr::count(), by = c("cluster_id","contrast")) %>% 
    dplyr::mutate(cluster_id = paste0(cluster_id, "\nnr of genes: ", n)) %>% dplyr::mutate(`p-value <= 0.05` = p_val <= 0.05) %>% 
    ggplot(aes(x = p_val, fill = `p-value <= 0.05`)) + 
    geom_histogram(binwidth = 0.05,boundary=0, color = "grey35") + scale_fill_manual(values = c("grey90", "lightsteelblue1")) + 
    facet_grid(contrast~cluster_id) + ggtitle("P-value histograms") + theme_bw()

Can you send me the exact output that you get on your console for the previous and this part of code? (can be done via mail if you don't want to share it here). I need to see the exact output in order to be able to pinpoint the problem.

VivianQM commented 1 year ago

Hi @VivianQM Since you did not encounter any error now: can you run this as well

  celltype_de = list(
    de_output = c(DE_list %>% purrr::map("de_output")),
    de_output_tidy = DE_list %>% purrr::map("de_output_tidy") %>% bind_rows()
    )
  excluded_celltypes = celltypes %>% generics::setdiff(celltype_de$de_output_tidy$cluster_id) %>% unique()
  if (length(excluded_celltypes) > 0) {
    print("excluded cell types are:")
    print(excluded_celltypes)
    print("These celltypes are not considered in the analysis. After removing samples that contain less cells than the required minimal, some groups don't have 2 or more samples anymore. As a result the analysis cannot be run. To solve this: decrease the number of min_cells or change your group_id and pool all samples that belong to groups that are not of interest! ")
  }
  if (length(excluded_celltypes) == length(celltypes)) {
    print("None of the cell types passed the check. This might be due to 2 reasons. 1) no cell type has enough cells in >=2 samples per group. 2) problem in batch definition: not all levels of your batch are in each group - Also for groups not included in your contrasts!")
  }

  hist_pvals = celltype_de$de_output_tidy %>% dplyr::inner_join(celltype_de$de_output_tidy %>% dplyr::group_by(contrast,cluster_id) %>% dplyr::count(), by = c("cluster_id","contrast")) %>% 
    dplyr::mutate(cluster_id = paste0(cluster_id, "\nnr of genes: ", n)) %>% dplyr::mutate(`p-value <= 0.05` = p_val <= 0.05) %>% 
    ggplot(aes(x = p_val, fill = `p-value <= 0.05`)) + 
    geom_histogram(binwidth = 0.05,boundary=0, color = "grey35") + scale_fill_manual(values = c("grey90", "lightsteelblue1")) + 
    facet_grid(contrast~cluster_id) + ggtitle("P-value histograms") + theme_bw()

that block ran without printing anything. I will send you those results asap, just wanted to check with my PI first, sorry

browaeysrobin commented 1 year ago

Hi @VivianQM

So that block of code also ran without errors? It is very weird that both blocks of code would run without errors, since those are the lines of code in get_DE_info. Are you sure get_DE_info is still giving errors?

I will send you those results asap, just wanted to check with my PI first, sorry

I understand (but just want to see console output, not the data objects of course)

VivianQM commented 1 year ago

Hi @VivianQM

So that block of code also ran without errors? It is very weird that both blocks of code would run without errors, since those are the lines of code in get_DE_info. Are you sure get_DE_info is still giving errors?

I will send you those results asap, just wanted to check with my PI first, sorry

I understand (but just want to see console output, not the data objects of course)

Yes it is still giving out this error: Warning message: “Unknown or uninitialised column: cluster_id.” [1] "excluded cell types are:" [1] "Granulosa" "Theca"
[1] "These celltypes are not considered in the analysis. After removing samples that contain less cells than the required minimal, some groups don't have 2 or more samples anymore. As a result the analysis cannot be run. To solve this: decrease the number of min_cells or change your group_id and pool all samples that belong to groups that are not of interest! " [1] "None of the cell types passed the check. This might be due to 2 reasons. 1) no cell type has enough cells in >=2 samples per group. 2) problem in batch definition: not all levels of your batch are in each group - Also for groups not included in your contrasts!" Error in dplyr::group_by(): ! Must group by variables found in .data. Column contrast is not found. Column cluster_id is not found. Traceback:

get_DE_info(sce = sce, sample_id = sample_id, group_id = group_id, . celltype_id = celltype_id, covariates = covariates, contrasts_oi = contrasts_oi, . min_cells = 10, findMarkers = TRUE)
celltype_de$de_output_tidy %>% dplyr::inner_join(celltype_de$de_output_tidy %>% . dplyr::group_by(contrast, cluster_id) %>% dplyr::count(), . by = c("cluster_id", "contrast")) %>% dplyr::mutate(cluster_id = paste0(cluster_id, . "\nnr of genes: ", n)) %>% dplyr::mutate(p-value <= 0.05 = p_val <= . 0.05) %>% ggplot(aes(x = p_val, fill = p-value <= 0.05))
ggplot(., aes(x = p_val, fill = p-value <= 0.05))
dplyr::mutate(., p-value <= 0.05 = p_val <= 0.05)
dplyr::mutate(., cluster_id = paste0(cluster_id, "\nnr of genes: ", . n))
dplyr::inner_join(., celltype_de$de_output_tidy %>% dplyr::group_by(contrast, . cluster_id) %>% dplyr::count(), by = c("cluster_id", "contrast"))
inner_join.data.frame(., celltype_de$de_output_tidy %>% dplyr::group_by(contrast, . cluster_id) %>% dplyr::count(), by = c("cluster_id", "contrast"))
auto_copy(x, y, copy = copy)
same_src(x, y)
same_src.data.frame(x, y)
is.data.frame(y)
celltype_de$de_output_tidy %>% dplyr::group_by(contrast, cluster_id) %>% . dplyr::count()
dplyr::count(.)
dplyr::group_by(., contrast, cluster_id)
group_by.data.frame(., contrast, cluster_id)
group_by_prepare(.data, ..., .add = .add, error_call = current_env())
abort(bullets, call = error_call)
signal_abort(cnd, .file)

Very strange indeed...

browaeysrobin commented 1 year ago

@VivianQM

get_DE_info(sce = sce, sample_id = sample_id, group_id = group_id, . celltype_id = celltype_id, covariates = covariates, contrasts_oi = contrasts_oi, . min_cells = 10, findMarkers = TRUE)

I see you have put findMarkers = TRUE here, can you set it to FALSE ? (probably will be the same error message, but you never know)

VivianQM commented 1 year ago

@VivianQM

get_DE_info(sce = sce, sample_id = sample_id, group_id = group_id, . celltype_id = celltype_id, covariates = covariates, contrasts_oi = contrasts_oi, . min_cells = 10, findMarkers = TRUE)

I see you have put findMarkers = TRUE here, can you set it to FALSE ? (probably will be the same error message, but you never know)

Same error. Sorry :(

ekawaler commented 1 year ago

I'm having the same problems - can send over my output from the get_DE_info() replacement function if that would help?

browaeysrobin commented 1 year ago

Hi @ekawaler

I'm having the same problems - can send over my output from the get_DE_info() replacement function if that would help?

Yes that would help. The more similarities I see between all your output that are not present in the output of the several datasets I tested on will help.

VivianQM commented 1 year ago

@VivianQM

get_DE_info(sce = sce, sample_id = sample_id, group_id = group_id, . celltype_id = celltype_id, covariates = covariates, contrasts_oi = contrasts_oi, . min_cells = 10, findMarkers = TRUE)

I see you have put findMarkers = TRUE here, can you set it to FALSE ? (probably will be the same error message, but you never know)

Email sent!

ekawaler commented 1 year ago

More info: when I run the same dataset using the pairwise wrapper it does work!

catsargent commented 1 year ago

Hi Robin,

I did this and it ended with Error in x[[1]] : subscript out of bounds.

I have emailed you the output.

Many thanks for looking into this!

Catherine

@catsargent

Everything looks fine at first sight. I have no idea what is going wrong and it's very hard when I cannot reproduce your errors myself.

To better see where exactly things might go wrong, can you run the following snippet of code in the place where you would normally run get_DE_info()?

celltypes = SummarizedExperiment::colData(sce)[,celltype_id] %>% unique()
contrasts = contrasts_oi

perform_muscat_de_analysis_test = function(sce, sample_id, celltype_id, group_id, batches, covariates, contrasts, assay_oi_pb = "counts", fun_oi_pb = "sum", de_method_oi = "edgeR", min_cells = 10, filterByExpr.min.count = 7, filterByExpr.min.total.count = 15, filterByExpr.large.n = 4, filterByExpr.min.prop = 0.7){
  # conditions of interest in the contrast should be present in the in the group column of the metadata
  groups_oi = SummarizedExperiment::colData(sce)[,group_id] %>% unique()
  conditions_oi = stringr::str_split(contrasts, "'") %>% unlist() %>% unique() %>%
    # stringr::str_split("[:digit:]") %>% unlist() %>% unique() %>%
    stringr::str_split("\\)") %>% unlist() %>% unique() %>%
    stringr::str_split("\\(") %>% unlist() %>% unique() %>%
    stringr::str_split("-") %>% unlist() %>% unique() %>%
    stringr::str_split("\\+") %>% unlist() %>% unique() %>%
    stringr::str_split("\\*") %>% unlist() %>% unique() %>%
    stringr::str_split("\\/") %>% unlist() %>% unique() %>% generics::setdiff(c("",","," ,", ", ")) %>% unlist() %>% unique()
  conditions_oi = conditions_oi[is.na(suppressWarnings(as.numeric(conditions_oi)))]

  # conditions of interest in the contrast should be present in the in the contrast_tbl
  contrasts_simplified = stringr::str_split(contrasts, "'") %>% unlist() %>% unique() %>%
    stringr::str_split(",") %>% unlist() %>% unique() %>% generics::setdiff(c("",",")) %>% unlist() %>% unique()

  # prepare sce for the muscat pseudobulk analysis
  sce$id = sce[[sample_id]]
  sce = muscat::prepSCE(sce,
                              kid = celltype_id, # subpopulation assignments
                              gid = group_id,  # group IDs (ctrl/stim)
                              sid = "id",   # sample IDs (ctrl/stim.1234)
                              drop = FALSE)  # drop all other SummarizedExperiment::colData columns ----------------- change to false
  print("sce after prepSCE")
  print(sce)

  pb = muscat::aggregateData(sce,
                             assay = assay_oi_pb, fun = fun_oi_pb,
                             by = c("cluster_id", "sample_id"))
  print("pb after aggregateData")
  print(pb)

  # prepare the experiment info (ei) table if batches present
  if(length(batches) > 1){
    batches_present = TRUE
  } else {
    if(!is.na(batches)){
      batches_present = TRUE
    } else {
      batches_present = FALSE

    }
  }

  if(batches_present){
    extra_metadata = SummarizedExperiment::colData(sce)  %>% tibble::as_tibble() %>% dplyr::select(all_of(sample_id), all_of(batches)) %>% dplyr::distinct() %>% dplyr::mutate_all(factor)
  } else {
    extra_metadata = SummarizedExperiment::colData(sce) %>% tibble::as_tibble() %>% dplyr::select(all_of(sample_id)) %>% dplyr::distinct() %>% dplyr::mutate_all(factor) 
  } 
  if('sample_id' != sample_id){
    extra_metadata$sample_id = extra_metadata[[sample_id]]
  }
  ei = S4Vectors::metadata(sce)$experiment_info

  ei = ei %>%  dplyr::inner_join(extra_metadata, by = "sample_id")

  # prepare the experiment info (ei) table if covariates present
  if(length(covariates) > 1){
    covariates_present = TRUE
  } else {
    if(!is.na(covariates)){
      covariates_present = TRUE
    } else {
      covariates_present = FALSE

    }
  }

  if(covariates_present){
    extra_metadata = SummarizedExperiment::colData(sce)  %>% tibble::as_tibble() %>% dplyr::select(all_of(sample_id), all_of(covariates)) %>% dplyr::distinct() ## no factorization of the continuous covariates!
  } else {
    extra_metadata = SummarizedExperiment::colData(sce) %>% tibble::as_tibble() %>% dplyr::select(all_of(sample_id)) %>% dplyr::distinct() %>% dplyr::mutate_all(factor) 
  } 
  if('sample_id' != sample_id){
    extra_metadata$sample_id = extra_metadata[[sample_id]]
  }
  if(is.null(ei)){
    ei = S4Vectors::metadata(sce)$experiment_info
  }

  ei = ei %>%  dplyr::inner_join(extra_metadata, by = "sample_id")

  # prepare the design and contrast matrix for the muscat DE analysis

  if(batches_present == TRUE & covariates_present == FALSE){

    batches_string = paste0("ei$",batches) %>% paste(collapse = " + ")
    design = eval(parse(text=paste("model.matrix(~ 0 + ei$group_id + ", batches_string, " ) ",sep="")))
    dimnames(design) = list(ei$sample_id, c(levels(ei$group_id), make.names(colnames(design)[(length(levels(ei$group_id))+1):length(colnames(design))])) )

  } else if (batches_present == FALSE & covariates_present == TRUE) {

    covariates_string = paste0("ei$",covariates) %>% paste(collapse = " + ")
    design = eval(parse(text=paste("model.matrix(~ 0 + ei$group_id + ", covariates_string, " ) ",sep="")))
    dimnames(design) = list(ei$sample_id, c(levels(ei$group_id), make.names(colnames(design)[(length(levels(ei$group_id))+1):length(colnames(design))])) )

  } else if (batches_present == TRUE & covariates_present == TRUE) {

    batches_string = paste0("ei$",batches) %>% paste(collapse = " + ")
    covariates_string = paste0("ei$",covariates) %>% paste(collapse = " + ")

    design = eval(parse(text=paste("model.matrix(~ 0 + ei$group_id + ", batches_string, " + ", covariates_string, " ) ",sep="")))
    dimnames(design) = list(ei$sample_id, c(levels(ei$group_id), make.names(colnames(design)[(length(levels(ei$group_id))+1):length(colnames(design))])) )

  } else if (batches_present == FALSE & covariates_present == FALSE) {

    design = eval(parse(text=paste("model.matrix(~ 0 + ei$group_id) ",sep="")))
    dimnames(design) = list(ei$sample_id, c(levels(ei$group_id)))

  }

  contrast = eval(parse(text=paste("limma::makeContrasts(", contrasts, ",levels=design)",sep="")))

  # filter genes
  celltype = SummarizedExperiment::assayNames(pb) %>% .[1]
  pb = pb[edgeR::filterByExpr(y = SummarizedExperiment::assay(pb, celltype), design = design, min.count = filterByExpr.min.count, min.total.count = filterByExpr.min.total.count, large.n = filterByExpr.large.n, min.prop = filterByExpr.min.prop), ]
  print("pb after gene filtering")
  print(pb)

  # run DS analysis
  res = muscat::pbDS(pb, method = de_method_oi , design = design, contrast = contrast, min_cells = min_cells, verbose = FALSE, filter = "none")
  print("res")
  print(head(res[[1]][[1]][[1]]))

  de_output_tidy = muscat::resDS(sce, res, bind = "row", cpm = FALSE, frq = FALSE) %>% tibble::as_tibble() %>% dplyr::rename(p_adj = p_adj.glb)
  print("de_output_tidy")
  print(head(de_output_tidy))

  output = list(de_output = res, de_output_tidy = de_output_tidy)
  return(output)
}
DE_list = celltypes %>% lapply(function(celltype_oi, sce){
  sce_oi = sce[, SummarizedExperiment::colData(sce)[,celltype_id] == celltype_oi]
  DE_result = perform_muscat_de_analysis_test(sce = sce_oi, 
                                     sample_id = sample_id, 
                                     celltype_id = celltype_id, 
                                     group_id = group_id, 
                                     batches = batches, 
                                     covariates = covariates, 
                                     contrasts = contrasts_oi)
}, sce)

Can you send me the output and error/warnings? (You can do that via e-mail if you would prefer that)

browaeysrobin commented 1 year ago

More info: when I run the same dataset using the pairwise wrapper it does work!

Hi @ekawaler: So you mean no errors are thrown when you follow the steps in: https://github.com/saeyslab/multinichenetr/blob/main/vignettes/pairwise_analysis_MISC.md ?

For @VivianQM and @catsargent: can you try the same as @ekawaler and run the steps in this vignette? Does it work that way as in @ekawaler's case?: https://github.com/saeyslab/multinichenetr/blob/main/vignettes/threewise_analysis_MISC.md

For @ekawaler @VivianQM @catsargent: Can you share your sessionInfo?

browaeysrobin commented 1 year ago

Hi @ekawaler @VivianQM @catsargent and others with this problem:

I adapted the get_DE_info function. The previous function threw the wrong error message sometimes (not only when cell types were really missing, but also when another error occurred during the DE analysis), this should be fixed now and allow us to better identify each of your problems.

You can try using the new function after reinstalling multinichenetr.

Please send me the new error message you get. If the error message would already be clear for you to fix, try to do so as well, and let me know if that fix worked.

catsargent commented 1 year ago

Ok thanks. Like @ekawaler, it also worked for me when using the wrapper function... using that it only complained about the actual cell types not meeting the criteria for DE analysis and then moved on using the cell types which did.

I shared my session_info in an email to you Robin.

VivianQM commented 1 year ago

@browaeysrobin Reinstalled, re-ran everything, and the error went away. HORAYYY! Thank you so much for your help and patience. And thanks to all who helped! Seems like everyone is doing fine, so I will go ahead and close this issue.