smorabit / hdWGCNA

High dimensional weighted gene co-expression network analysis
https://smorabit.github.io/hdWGCNA/
Other
316 stars 31 forks source link

number of metacells created far larger than expected #27

Closed Brawni closed 1 year ago

Brawni commented 1 year ago

Hello! Nice package! I can see this analysis framework for single cell going to be extremely useful and widely used. Im puzzled by the number of metacells that are generated from my dataset using MetacellsByGroups.

srt_wgcna$project = 'project'
srt_wgcna <- MetacellsByGroups (
  seurat_obj = srt_wgcna,
  group.by = c('orig.ident' , 'project'), # specify the columns in seurat_obj@meta.data to group by
  k = 50, # nearest-neighbors parameter
  max_shared = 0,
  ident.group = 'project'
)
> srt_wgcna
An object of class Seurat 
33619 features across 24526 samples within 1 assay 
Active assay: RNA (33619 features, 2000 variable features)
 2 dimensional reductions calculated: pca, umap
> GetMetacellObject(srt_wgcna)
An object of class Seurat 
33619 features across 9960 samples within 1 assay 
Active assay: RNA (33619 features, 0 variable features)

How is it possible that i get 9960 metacells from 24526 cells if they are 50 cells each with no overlap?

Thanks!

smorabit commented 1 year ago

Hi,

To clarify, the max_shared argument is new to this function, and I am actually aware that it does not work as expected yet. I wanted to add a feature to this function to control for the amount of overlap between different metacells. I started to work on a solution but it isn't finished yet, my apologies for the confusion. I tagged this issue as an enhancement and I will have an update soon.

smorabit commented 1 year ago

Hi again,

The max_shared argument is now working as expected. Please update to the newest version of hdWGCNA and run this step again. Also wanted to mention that having 0 overlap is not a great idea for this analysis, since you will likely have too few data points to work with in constructing the co-expression network. Based on your issue I did add a new parameter to MetacellsByGroups called target_metacells which lets you select the target number of metacells to generate for each grouping. Note that the target will not always be reached if there aren't enough metacells that meet the overlapping criteria set with max_shared.

Brawni commented 1 year ago

That was quick! Is ~500 metacells not enough observations? I only have 1 cell type including 24526 cells from a bunch of samples each containing at least 1000 cells. Also samples are very homogenous so i dont care too much for representability. Btw I've seen this strategy of overlapping metacells in other instances but shouldnt you want to prioritize independent observations vs sample size? What are your thoughts?

Thanks!

smorabit commented 1 year ago

Is ~500 metacells not enough observations? I only have 1 cell type including 24526 cells from a bunch of samples each containing at least 1000 cells. Also samples are very homogenous so i dont care too much for representability.

So 500 target metacells was the default that I chose which reasonably well in the case where you are grouping the dataset by both biological sample and different cell types. It may be too small in your case where you have only one cell type and you're allowing grouping across different samples. I am actually not sure of the minimum number of metacells that would still result in a reasonable result with the downstream co-expression network analysis. In your case you may have to try a few different values.

Btw I've seen this strategy of overlapping metacells in other instances but shouldnt you want to prioritize independent observations vs sample size? What are your thoughts?

There are a few tradeoffs here. During the metacell construction process, we are trying to reduce the sparsity of the dataset, while maintaining cellular heterogeneity, and the metacell aggregation approach seems to capture a range of cell states while reducing sparsity by about 10 fold depending on your k parameter. In terms of what's an "independent observation" in single cell data, I would argue that different cells coming from the same biological repicate are not necessarily independent observations since these cells are all part of the same biological system. This is why we suggest constructing metacells within a given sample, rather than making metacells that can come from more than one sample.

In general there's going to be different parameters that are better or worse depending on the input dataset (true of many packages including Seurat), and it is really up to the user to make a decision of what makes sense for their system.

Brawni commented 1 year ago

Thanks for your comments!