pinellolab / dictys

Context specific and dynamic gene regulatory network reconstruction and analysis
GNU Affero General Public License v3.0
108 stars 14 forks source link

Question regarding tissue level GRN inference #70

Open PauBadiaM opened 2 days ago

PauBadiaM commented 2 days ago

Hi @lingfeiwang and @nikostrasan,

I want to apply Dictys to infer a static tissue-level GRN rather than multiple cell-type-specific ones, as I am working with data that has already reached a steady state and I do not expect sufficient variability within each cell group. I have considered two strategies:

1) Infer footprints for each cell type, assign them to my peaks, then summarize the scores across cell types using the mean to obtain a single scaffold GRN, which will then be fed into the modeling step. 2) Infer footprints and model a GRN for each cell type individually, and then combine the resulting GRNs (summarizing repeated elements using the mean).

Which approach do you think would work best with Dictys?

Many thanks!

lingfeiwang commented 2 days ago

Hi Pau,

That's a great question.

First of all, I heard of the argument of insufficient variability from several people, but I haven't yet encountered this. Actually (co)variability is the only source of information from which we can draw statistical conclusions from the transcriptome in steady state, and even weak signals of (co)variability can be observed with sufficient cells. For example, single-cell co-expression networks typically contain meaningful biology although the cells used come from a particular cell type in a single sample and are fully differentiated. Any chance you could elaborate on why you do not expect sufficient variability within each cell group, if not insufficient cells? If you have any reference e.g. claiming that variability is typically insufficient for GRN inference among cells from the same cell type/state and condition, that would be really helpful too.

Having said that, several other choices can be made to infer tissue level GRNs. For example, how to summarize scores, mean, weighted mean, or max? Do we use all cell types or select some, e.g. based on cell count? In principle, we cannot fully answer your question or these without comprehensive benchmarking. Even for an educated guess, we ideally still want to know what your tissue-level GRN really means (definition), how you would typically use it (application), and whether cell count is a bottleneck.

Best, Lingfei

PauBadiaM commented 2 days ago

Hi @lingfeiwang,

Thanks for the reply! While variability between cell types is large, within individual types it tends to be limited. This has implication for GRN inference when using regression-based methods, as these typically rely on variation between features. My worry is that the low within-cell-type variability might obscure some cell-type-specific regulatory interactions, particularly when using typical single cell atlases (cells not in development).

Here is an example scheme taken from my thesis (still in writing) showing this issue: image Blue cells (imagine they are B-cells) have a specific marker TF (let's imagine PAX5) that regulates specific marker genes for B-cells (IG genes for example). When including all the other cell types, there is enough variability to find a significant trend and PAX5 will be included in the GRN. However, when we only include B-cells no trend is found since all B-cells do express PAX5 (with its expression magnitude being random).

This is counterintuitive for what "cell-type" specific inference is trying to do, since marker TFs (such as PAX5 in B-cells) will not be recovered. This only applies for clusters that have reached steady state, but in real trajectories such as in development there will be enough intercluster variability to find a trend.

The question is, how to find cell type specific GRNs? Well, we can subset the tissue-level one by infering TF activities at the cluster level and filtering for significance (with decoupler's ulm or any preferred enrichment score that provides a p-value).

Regarding the strategies for summarizing scores, benchmarking is necessary but I was wondering if you had an intuition of which of the two could work best. For now, I am leaning towards 1) since more observations will be included in the regression modeling, but I am open to suggestions based on your opinions.

lingfeiwang commented 1 day ago

Hi Pau,

Thank you for the really nice illustration and your detailed clarification.

However this exact scenario is what we (the causal inference people) have been trying to avoid for decades. Please find a counter example at https://csinva.io/notes/assets/confounding_ex.png.

Generally speaking, we can indeed introduce extra variability but that would also introduce/aggravate confounders. We can always perform regression analysis but, unlike causal inference, a large coefficient does not indicate a true gene regulation. An alternative explanation is a confounder not included in your model separately controls both the TF and the target gene expression, and therefore increases their variability. As you probably know, there are many other mechanisms that can act as confounders and control gene expression, such as non-TF genes and epigenetics, and their variability can vary hugely across cell types. On the other hand, if we think in a Waddington landscape perspective, each terminally differentiated cell type or similar steady state would be in their respective local minimum. It is questionable whether linear models can well characterize the complex landscape encompassing multiple cell types/local minima.

I also wish to question if we actually see variability vs no variability, or strong vs weak variability in real data. Weak variability can often be solved with more single cell data and, if not, it should be justified with evidence which unfortunately I have never seen. Personally, I would prefer to report weak variability as a lack of statistical significance rather than making it a false positive. Of course ultimately it comes down to benchmarking to understand whether the tradeoff from confounders would worth the increased sensitivity from the extra variability.

Regarding your particular question of PAX5, because it is only expressed at a particular stage of B cell development, I would regard using only cells at this stage to be appropriate. The variability can be weak, or equivalently the number of cells can be limited. However, I imagine a biological investigation would expand their data collect effort or enrich these cells if they are seriously looking into this question. Similarly, I would expect computational method development or benchmarking studies to focus on other TFs that function in the most abundant cell types. It's a shame that people are often expected to do whatever analysis needed from the data given to them. I do not know if you fall into any category but if you have to do it, given the bottleneck is the cell count (in transcriptome), I would suggest to use strategy 1 and try both max and mean summarization given its transient functioning. However, I would always caution that it's not the intended use and you should interpret the results carefully, ideally with followup validation experiments.

Apologies for this opinionated reply. As a researcher and educator, I have not been able to comprehend well the perception and transmission of ideas without evidence or theory. I wouldn't hesitate to acknowledge ignorance, because it would bring about a healthier community than throwing suspicions that may mislead people.

Happy to follow up if you have further questions.

Lingfei