stefpeschel / NetCoMi

Network construction, analysis, and comparison for microbial compositional data
GNU General Public License v3.0
142 stars 24 forks source link

conditional dependence #58

Closed gc26762524 closed 1 year ago

gc26762524 commented 1 year ago

Hello NetCoMi developer,

Please allow me to create this issue, with extensive reading/googling in recent days, I just feel here might be the best place to ask this question.

In a manuscript we get for revision, where essentially we constructed two networks (case & control), and then made some basic comparisons about the common/different edges and node-pairs in the study. we get a review question "For the whole network analysis unless the datasets really are the same size - the same number of samples, and the same set of features - then the presence of an edge in one network but not the other is risky to conclude anything from. I would instead propose to build the network from all samples and add an interaction term between edge and disease status, then look for edges where this interaction term is significant. Either way, I would need more certainty that this part is not subject to artifacts before I can assess the network results."

My questions are

  1. do you agree that "the same number of samples" is the prerequisite for a fair network comparison? (assuming the two networks have the same domain/nature of course). In the NetCoMi examples amgut_split$no & yes, they don't have the same number of samples either, and so in many other published network comparison papers.
  2. I agree with the reviewer that pointed out a single network should be built first and then compares the condition. I would wonder if we could just use netConstruct() function for building two networks and then do the comparison using TUTORIAL_createAssoPerm tutorial (or netCompare()?) to address the "interaction" issue. Is this basically what you referred to in your paper about the "conditional dependence"? I am not quite clear on the term.

Any suggestions will be very appreciated. Thank you a lot in advance.

Cheng

stefpeschel commented 1 year ago

Dear Cheng,

Thank you for reaching out to us.

Constructing a single network and including the disease status would be an option to assess the association between microbes and the disease status. However, a network comparison between cases and controls or two environmental states is a valid approach to find microbes that are differentially associated between two groups and to reveal differences in the network structure. This approach is common practice, see e.g., Sommer et al. [2022]. Comparing two networks allows, for instance, to assess whether microbes cluster differently between the groups, or whether keystone taxa (expressed by hub nodes) are different.

The sample size issue is a legitimate point, which we mention in the discussion of our manuscript. Fisher [1921] has shown that the sample correlations $r$ vary around the true, unobserved correlation coefficient $\rho$ with a variance that is dependent on the sample size as well as on $\rho$ (see i.e., Figures 1 and 2 in Fisher, [1921]).

Here is an example of why this is problematic: Imagine the data set is split into two groups and the true population correlations are close to 0 for most taxa pairs. According to Fisher [1921], the sample correlations are nearly normal distributed if $\rho$ is close to 0. Let the sample size of group 1 be considerably smaller than that of group 2, then the variance of estimated correlations is larger in group 1. If we chose a threshold of |±0.3| in group 1, more taxa pairs would be expected to be connected in the network, just because of their greater variance. This may lead to spurious results when network properties are compared between the two groups.

This issue can be addressed to some extent by using Student’s t-test as sparsification method. However, for considerably different sample sizes – especially if one group is particularly small – the sample size is still an issue.

The interplay between association estimates, normalization, and sample size is examined and greatly illustrated in Badri et al. [2020]. The authors show that shrinkage leads to association estimates that are nearly independent of sample size, which would be a necessary behavior to be comparable between groups. Since shrinkage is not yet implemented in NetCoMi, (nearly) equal sample sizes would be important for a reliable network comparison.

As for your second question, I would like to clarify two terms:

The “createAssoPerm” tutorial is not related to your issue. It explains how to estimate permutation associations outside netConstruct(), which might be useful for large data sets.

Best, Stefanie

References:

gc26762524 commented 1 year ago

Hi Stefanie,

Thank you so much for the clarification and explanation. I have read through all the information and reference materials you provided that I am able to get a good sense of your answer and I am clear now.

Again, thank you. I will continue to use your fantastic tool and recommend NetCoMi to my friends. Wish you a good day there.

Best, Cheng

indrikwijaya commented 8 months ago

Hi,

Sorry to reopen this issue. But, I'm wondering how we could address the issue of having very different sample sizes between the 2 groups. Do we then subsample the larger group so that the number of samples between the 2 groups will be equal? So, the sampleSize parameter in NetConstruct will be equal to the size of the smaller group?

stefpeschel commented 8 months ago

Hi, Yes you can randomly subsample from the larger group. The sampleSize argument expects a vector with one or two elements. If the sample sizes are equal, you can give just a single value, which will be used for both groups.