meringlab / FlashWeave.jl

Inference of microbial interaction networks from large-scale heterogeneous abundance data
Other
70 stars 8 forks source link

relationship about parameters heterogeneous=true and make_sparse #36

Closed huizhen-yan closed 4 months ago

huizhen-yan commented 9 months ago

Hi, I have two questions about the parameters of FlashWeave. Q1: Based on the learn_network() help page, "make_sparse - use a sparse data representation (should be left at true in almost all cases)", sparse should always be set to true. However, it has been observed that even when make_sparse=true, sparse is still false in Run information when heterogeneous=false. On the other hand, when heterogeneous=true, sparse is automatically true. If the sparse parameter is bound to heterogeneous, what is its use for non-heterogeneous data?

Q2: There are only about 100 samples in my OTU matrix (seawater), so I set heterogeneous=false according to the help page (far less than thousands of samples). As a result, the degree distribution of this network approximates the Poisson distribution, similar to the random network. But the degrees of nodes exhibit a power-law distribution when heterogeneous=true. Therefore, I would like to know what are the basic requirements for heterogeneous data. The following three figures show the degree distribution of the network generated by different parameters/methods.

image image image

Can you help me?

jtackm commented 8 months ago

Hi Yan,

Regarding Q1: Good catch! Make_sparse should only be forced for sensitive=true + heterogeneous=false, since the adaptive clr-normalization removes zeros anyways. For other settings, it seems the preprocessing currently defaults to producing sparse data sets (which is usually optimal for performance and memory) and ignores manual overwrites: something to fix! But note that the generated networks should be the same, this is only for performance reasons.

Regarding Q2: while heterogeneous=true is more power hungry (hence the >1k samples rule-of-thump, but this is generous), the more important aspect is how many structural zeros you expect to be present. For datasets that combine very different environments (say soil and marine), these are expected to dominate the signal and can lead to spurious inferences if not accounted for, heterogeneous=true helps with this. If your data is however quite homogeneous (single habitats with low fraction of structural zeros expected), heterogeneous=false may provide more sensitivity. But there are unfortunately no hard rules here. The difference in degree distribution is indeed curious, would be interesting to get to the core of this. My first guess would be pseudo-count induced normalization artifacts (hererogeneous=true avoids pseudo-counts), but it's hard to know without digging deeper.