smorabit / hdWGCNA

High dimensional weighted gene co-expression network analysis
https://smorabit.github.io/hdWGCNA/
Other
315 stars 31 forks source link

Advice for the workflow, cell type subset analysis, pseudobulk, and running on large datasets #258

Open MaximilianNuber opened 2 weeks ago

MaximilianNuber commented 2 weeks ago

Dear Dr. Morabito,

Thank you for the nice package, it is exactly what I needed. I have a few questions mainly to make sure I am using hdWGCNA correctly, as I don´t understand it deeply, yet.

  1. As I am analyzing a big dataset, I subset the Seurat object to single cell types before SetupForWGCNA, similarly to the PBMC example in the Cell publication. The goal is to perform differential expression between conditions within the entire cell type, but also between clusters within the cell types. As you mentioned in your tutorials that clusters with few cells may be excluded in MetaCellyByGroups, for celltype X I add the subclusters in group.by:

    seurat_obj <- MetacellsByGroups(
    seurat_obj = seurat_obj,
    group.by = c("louvain_clusters", "dataset"), #  dataset is my subject variable
    reduction = "PCA",
    k = 25, 
    max_shared = 10, 
    ident.group = 'louvain_clusters' 
    )

    My thought was that the additional grouping variable keeps the relevant subclusters in the data for sure. Therefore the question, would that be correct? I have 40k to 60k in one cell type.

  2. When I subset my dataset to a celltype and then start with hdWGCNA, would that be equivalent to SetupForWGCNA on the entire dataset and subsetting in SetDatExpr?

  3. As I mentioned, the goal for me would be to use the differential expression to find differentially expressed modules. For each cell type I follow the "hdWGCNA in single-cell data" tutorial and then perform differential expression as per the "Differential module eigengene (DME) analysis" tutorial. Did I miss something?

  4. In the "hdWGCNA in single-cell data" tutorial, at the section of SetDatExpression you mention, that either the meta cell expression or the single-cell expression matrix could be used, i.e. for using ScTransformed data. But then, what would be the point of the metacells? 5.In the PBMC example of your publication you mentioned 50 cells per metacell being used. Did you manipulate the option target_metacells to arrive at that, or is there another option I missed? And does computation speed increase or decrease with more cells per metacell?

  5. Naturally, I tried running hdWGCNA on my entire dataset of about 500k cells, which was slow. In your publication, you mention the metacell aggregation on >550k T-cells took about 85 min. For my dataset (with all cell types) it was not done after 9 hours, and crashed over night. Is this explainable by the complexity of having more than T-cells?

  6. I was wondering why MetaCellsByGroups takes so long. Do you happen to know what the bottleneck is? (Spontaneously, I was guessing FNN::knn.index)

  7. If I choose to use pseudobulk in hdWGCNA, does the workflow change (>60 samples)?

  8. If I have single cell data, and bulk RNA data per cell type, would it make sense to pseudobulk the single cell data and use a consensus workflow within your package?

My apologies for the amount of questions, I am just getting to know your package and WGCNA. Some of my questions may be answered already in the tutorials or the publication, but as a biologist by training I sometimes have difficulties following every detail. If that happened, my apologies, too.

Thank you already for any help and the nice tool.

Best, Max

smorabit commented 2 weeks ago

Hi,

Thank you for your interest in hdWGCNA. It seems like your questions are somewhat related to each other but in the future it is much easier if you just open multiple smaller issues as we request in the issue template.

My thought was that the additional grouping variable keeps the relevant subclusters in the data for sure. Therefore the question, would that be correct? I have 40k to 60k in one cell type.

In MetacellsByGroups, the parameter group.by ensures that metacells will be created from that particular grouping of cells.

When I subset my dataset to a celltype and then start with hdWGCNA, would that be equivalent to SetupForWGCNA on the entire dataset and subsetting in SetDatExpr?

The network and modules will likely be the same, but if you do not subset prior to SetupForWGCNA then you will also be able to perform downstream analysis on the other cell types. Whether or not to subset before SetupForWGCNA depends on what kind of questions you want to answer with you co-expression network analysis, and personally I have done it both ways.

As I mentioned, the goal for me would be to use the differential expression to find differentially expressed modules. For each cell type I follow the "hdWGCNA in single-cell data" tutorial and then perform differential expression as per the "Differential module eigengene (DME) analysis" tutorial. Did I miss something?

Yes this makes sense.

In the "hdWGCNA in single-cell data" tutorial, at the section of SetDatExpression you mention, that either the meta cell expression or the single-cell expression matrix could be used, i.e. for using ScTransformed data. But then, what would be the point of the metacells?

You can choose to use single-cell expression matrix if you want, we recommend using metacells or pseudobulk for network construction but we wanted to give the user the option if they wanted to use the single-cell matrix.

Naturally, I tried running hdWGCNA on my entire dataset of about 500k cells, which was slow. In your publication, you mention the metacell aggregation on >550k T-cells took about 85 min. For my dataset (with all cell types) it was not done after 9 hours, and crashed over night. Is this explainable by the complexity of having more than T-cells?

I don't have any explanation for how long it takes to run on any particular machine, and I am not sure if speculating would be helpful.

I was wondering why MetaCellsByGroups takes so long. Do you happen to know what the bottleneck is?

I am not sure what the bottleneck is.

If I choose to use pseudobulk in hdWGCNA, does the workflow change (>60 samples)?

Please refer to this tutorial for pseudobulk hdWGCNA.

If I have single cell data, and bulk RNA data per cell type, would it make sense to pseudobulk the single cell data and use a consensus workflow within your package?

This is up to you, I am not sure if I can make a recommendation without knowing the experimental design and the biological questions.