shaido987 / riskloc

Implementation of RiskLoc, a method for localizing multi-dimensional root causes.
MIT License
120 stars 21 forks source link

question of surprise in adtributor #4

Closed davidlight2018 closed 2 years ago

davidlight2018 commented 2 years ago

The calculation of surprise value in adtributor seems not correct to me.

The JS divergense formula should be:

2021-12-15_18-19-49

So, the code should be:

p = df['predict'] / F
q = df['real'] / A
m = (p + q) / 2
df['surprise'] = 0.5 * np.sum(p * np.log(p/m)) + 0.5 * np.sum(q * np.log(q/m))

what do you think? thanks.

shaido987 commented 2 years ago

This will give all leaf elements the same surprise independently of their usefulness so it would not be very useful (`df['surprise'] will have the same value for all rows). If you take a look at the original Adtributor paper[1], the surprise for an element is computed as:

image

For the code implementation, in adtributor the sum over leaf elements is done within the for loop to obtain the total surprise for an element, i.e. here:

for d in dimensions:
        elements = df.groupby(d).sum()
        elements = elements.sort_values('surprise', ascending=False)
        ...

For adtributor_new the sum is done at the beginning. There will therefore only be 1 element considered at the time and there is no need to sum (as it already has been done).

[1]: Bhagwan, Ranjita, et al., Adtributor: Revenue debugging in advertising systems." 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14), 2014.

davidlight2018 commented 2 years ago

thx, very helpful !

shaido987 commented 2 years ago

No problems :)