Heterogeneous dataset subsetting and number of cells/cluster

Hello,

I am using CellOracle to analyse a 10x multiome (scRNA- + ATAC-seq) dataset. I will preface this post by saying that my analysis so far has only included GRN generation, not in silico TF perturbation. I don't have a continuous "developmental" population for most of my clusters, so the goal of my analysis is not to use pseudotime, in silico TF perfurbation, etc. but merely to generate cluster-specific GRNs to identify key regulators.

I have three questions:

My dataset includes a heterogeneous mix of cell types, from a pluripotent stem cell (PSC) differentiation. For instance, I have a mesoderm population and a haematopoietic one; although they are from the same PSC differentiation, they are clearly very different cell types. Does it make sense to calculate the set of variable genes (and the GRN) on the whole dataset, or should I subset it by cell types and calculate variable genes (and downstream analysis) separately, e.g. one set for mesoderm, one set for haematopoietic cells, etc. ? Similarly, I calculate the "linked" ATAC peaks that feed into the base GRN with Signac (instead of Cicero), using the whole dataset. Would it be better to use Signac/Cicero only on a subset of clusters representing a cell type, e.g. mesoderm, to only include mesoderm-specific ATAC-peaks?
In a preliminary analysis, I generated cluster-specific GRNs using the whole dataset. When I filtered the GRNs by betweenness centrality and assessed the expression of the top 30 GRN hits in the relevant clusters, I noticed that some TFs are barely expressed in the cell clusters of the GRNs they appear in. See attached plot as an example: HAND1 is not really expressed in Cell cluster 9, same for BHLHE40 in Cell cluster 1, TFEC in Cell cluster 2, etc. What is the reason for this? Also, is there a way to filter the GRN to only include genes expressed in >X% of cells in that cluster?
If I were to subset my dataset to only include single cell types, e.g. only mesoderm, and do the downstream analysis only on this subset (variable genes, "linked" ATAC peaks, GRN, etc.), is there a minimum number of cells that is required for the GRN to be accurate? I am mostly worried about dropouts, which will have a larger effect on cluster with lower cell numbers: this is a particular issue for scATAC-seq, where most open chromatin sites are not identified in a single cell, and pseudobulking of more cells is required to get nice tracks.

Thank you for your time!

morris-lab / CellOracle

Heterogeneous dataset subsetting and number of cells/cluster #197