skojaku / degree-corrected-link-prediction-benchmark

Link prediction
MIT License
3 stars 0 forks source link

Proof of AUC-ROC -> 1 for heterogeneous networks #66

Closed skojaku closed 5 months ago

skojaku commented 5 months ago

Inspired by the derivation by @MunjungKim, I did my exploration. Any hole?

Setting:

Question:

Let us denote $(i,j)$ the positive edge drawn from the excess degree distribution $p1(k)$, and $(x,y)$ the negative edge drawn from the degree distribution $p(k)$. Does $P(s{ij} \geq s _{xy})\rightarrow 1$ as $\alpha\rightarrow 2$ or $1$?

Some potentially useful properties

Power-law degree distribution

$$p(k) = \frac{\alpha-1}{k{\text{min}}} k^{-\alpha}$$ let's assume that $k{\text{min}}=1$ to simplify.

Cumulative distribution

$$\Phi(k) = 1-\int_k ^\infty p(k)\;{\rm d}k = 1-(\alpha-1)[] = 1-x^{-\alpha +1} $$

Average degree

$$\langle k \rangle = \int k p(k)\;{\rm d}k = (\alpha-1)\int k^{-\alpha+1}\;{\rm d}k = \frac{\alpha-1}{\alpha-2}$$

Excess degree distribution

$$ p_1(k) = \dfrac{1}{\int k'p(k')\;{\rm d}k'}kp(k) = \frac{\alpha-2}{\alpha-1} kp(k) = (\alpha-2)k^{-\alpha+1} $$

Derivation

Let us consider a subset of $(i,j)$ and $(x,y)$ that yield $s{ij} \geq s{xy}$. We focus on the case where $k_i \geq k_x$ AND $k_j \geq k_y$, which gives the lower bound, i.e.,

$$ P(k_x \leq k_i) \cdot P(k_y \leq kj) \leq P(s{xy} \leq s_{ij}) $$

Let us focus on $P(k_x \leq k_i)$, i.e.,

$$ \begin{align} P(k_x\leq k_i) &= \int p_1(k_i) \left(\int_0 ^{k_i} p(k_x) {\rm d}k_x\right){\rm d}k_i \ &= \int p_1(k_i) \Phi(k_i){\rm d}k_i \ &= (\alpha-2)\int k^{-\alpha + 1} (1-k^{-\alpha+1}){\rm d}k_i \ &= (\alpha-2) \left[ \int k^{-\alpha + 1}{\rm d}k_i -\int k^{-\alpha + 1}k^{-\alpha+1}{\rm d}k_i \right]\ &= (\alpha-2) \left[ \int k^{-\alpha + 1}{\rm d}k_i -\int k^{-2\alpha + 2}{\rm d}k_i \right]\ &= (\alpha-2) \left[ \frac{1}{\alpha-2} -\frac{1}{2\alpha-3} \right]\ &= 1 -\frac{\alpha-2}{2\alpha-3}\ \end{align} $$

Notice that $P(k_x\leq k_i) = 1$ when $\alpha = 2$. What about $P(k_y \leq k_j)$? Because $k_y$ and $k_j$ are independent of $k_x$ and $k_i$, we have

$$ P(k_x \leq k_i) = P(k_y \leq k_j). $$

Thus, we obtain:

$$ \left( 1 -\frac{\alpha-2}{2\alpha-3} \right)^2 \leq P(s{xy} \leq s{ij}). $$

Therefore, $P(s{xy} \leq s{ij})\rightarrow 1$ when $\alpha \rightarrow 2$.

Screenshot 2024-05-10 at 4 24 16β€―PM
skojaku commented 5 months ago

The lower bound is obviously very lose. It is below 0.5 for homogeneous networks (\alpha >3), which should not be. So it is nicer if we could tighten the bound. Also it is curious to compare against the numerical results by @rachithaiyappa for confirmation.

rachithaiyappa commented 5 months ago

Cool! Just a small note for correctness (should not change anything in your since you've assumed $k{min}=1$ p(k) = $\frac{(\alpha-1)}{k{min}}(\frac{k}{k_{min}})^{-\alpha}$

rachithaiyappa commented 5 months ago

image

@skojaku This is what I have so far. The orange curve is what we discussed in the morning (sampling $s^- \sim (p(k))^2$ and $s^+ \sim (kp(k))^2$)

The blue curve is by keeping $k_i$ and $kj$ separate, i.e $s{ij}^- \sim p(k_i)p(kj)$ and $s{ij}^+ \sim k_i p(k_i) \cdot k_jp(k_j)$

Just performed one run of the metropolis-hastings algorithm. Unable to get it to produce meaningful sample for larger values of $alpha > 2.5 $ I think its probably because my initial conditions arent ideal. Need to explore a little more on that front but this looks promising?

skojaku commented 5 months ago

Wow. It's very different. Seems that it aligns with my results. Let's see it with more samples.

MunjungKim commented 5 months ago

Wow this is so awesome sada!! πŸ€©πŸ™Œ πŸ‘ I will also explore the way to tighten the bound :)

yy commented 5 months ago

yeah I think the independent condition that both degrees are larger is way tighter than the product condition. I'm wondering whether we can do a change of variable $k \rightarrow \log k$ to make the product as a sum and then use the generating function to obtain the solution. I may try..

yy commented 5 months ago

Does this line of thought make sense?

  1. We change the variable of the distribution to obtain distributions of $\log k$ instead of $k$ (something like $(\alpha - 2) e^{-(\alpha - 2)l}$)
  2. We get two generating functions $G_1(x)$ and $G_0(x)$, each for the two distributions of $\log k$.
  3. $G_1^2(x)$ and $G0^2(x)$ now generates $\log s{ij}$ and $\log s{xy}$. The condition $s{ij} \geq s{xy}$ implies that $\log s{ij} = \log k_i + \log k_j \ge \log k_x + \log ky = \log s{xy}$.
  4. Let $\mathcal{N}$ denote the sorted sequence of all possible values of $\log k_1 + \log k_2$, where $k_1$ and $k_2$ are positive integers.
  5. Let's rewrite $G1^2(x)$ as $\tilde{A}(x) = \sum{i} a_i x^{\mathcal{N}_i}$ and $G_0^2(x)$ as $\tilde{B}(x) = \sum_i b_i x^{\mathcal{N}_i}$.
  6. The probability that we want to calculate $P(s{ij} \geq s{xy})$ can be written as the following sum: $S = a_0 b_0 + a_1(b_0 + b_1) + \dots + an \sum{i \le n} b_i + \dots$
  7. Given that the actual value of the exponents doesn't really matter (only the sequence matters), we can also define $A(x) = \sum_i a_i x^i$ and $B(x) = \sum_i b_i x^i$. The calculation of $S$ is same across $A(x)$ and $\tilde{A}(x)$ and $B(X)$.
skojaku commented 5 months ago

Sounds promising πŸ‘€. I could follow up to step 5. But how did you get the sum in step 6? I think it's derived from the probability generating function of $s{ij} - s{xy}$ but wondering how you represent the inequality $$s{ij} - s{xy} \geq 0$$.

yy commented 5 months ago

Oh it's just that when we have $s_{ij} = 1$ ($a0$), then the only possibility is $s{xy} = 1$ ($b_0$) and the probability is $a_0 b0$. Then for the next term of $s{ij}$ ($a_1$), we can only have $b_0$ and $b_1$, and so on. So $S$ simply counts every possible term. I don't know how to calculate it with the generating functions. πŸ™ˆ

skojaku commented 5 months ago

I see. I think it's possible if the generating functions $G^2_1$ and $G^2_0$ have a simple analytical form. We cannot assume the power law as the generating function does not exist. But I think log normal is promissing?

image

skojaku commented 5 months ago

Oh, the log-normal distribution is continuous. So it may not be applicable.... But assuming that the degree distribution is log-normal, $\log k_i$ is now a gaussian. We can probably compute the probability generating function for $S$.

skojaku commented 5 months ago

Result for log-normal degree distribution

I derived the exact, close-form expression of AUC-ROC for the log-normal degree distribution:

$$ \begin{align} 1 - \int_{-\infty} ^{\infty} \text{Norm}(z \vert 0, 1) \Phi\left( z -\sqrt{2}\sigma\right) \rm{d}z \end{align} $$

where $\text{Norm}$ is the probability density function of the (standard) normal distribution, and $\Phi$ is its cumulative function.

It is an increasing function of the variance parameter $\sigma^2$ of the log-normal distribution, meaning AUC-ROC increases as the node degree is more heterogeneously distributed.

Derivation

(see my note. It is also in the main text as an appendix) auc-roc-log-normal.pdf

Some TODO tasks

I have some todo tasks to complete the analysis. Would anyone take over the analysis?

Check the derivation

If you could have time and check my derivation, it's greatly appreciated!

Numerical simulations

I made an assumption that the degree distribution---a discrete distribution---follows the log-normal---a continuous distribution. The assumption is conventional though I'm not sure how good this approximation is.

It would be good to show this analytical result is valid using the simulation results, where the AUC-ROC is computed numerically by generating log-normally distributed random variables.

Find the $\sigma$ values for the empirical network

My results show that AUC-ROC approaches 1 as $\sigma$ increases. But it would be good to show it with the empirical $\sigma$ values using the data we have. I have the code to fit the log-normal distribution

from scipy import stats

sigma, loc, mu = stats.lognorm.fit(deg, floc=0)
skojaku commented 5 months ago
Screenshot 2024-05-18 at 5 32 48β€―AM

Expected: Based on the analytical results Observed: sigma values estimated by the moment method.

yy commented 5 months ago

Awesome!