Closed marquesarthur closed 3 years ago
As a follow-up, my current assumption is the Pagerank algorithm since s
defines the prob of hopping to another link.
Another issue might arise from while np.sum(np.abs(r-ro)) > maxerr and time.time() - start_time < 60
In a machine with low resources, the algorithm might finish before convergence.
def pageRank(G, s = .85, maxerr = .001):
"""
Args
----------
G: matrix representing state transitions
Gij can be a boolean or non negative real number representing the
transition weight from state i to j.
In terms of votes or links Gij is the number of votes/links from i to j,
therefore increasing the pagerank of node j.
Kwargs
----------
s: probability of following a transition. 1-s probability of teleporting
to another state. Defaults to 0.85
maxerr: if the sum of pageranks between iterations is bellow this we will
have converged. Defaults to 0.001
"""
n = G.shape[0] ## number of nodes, since G is a square matrix
# transform G into markov matrix M
M = np.array(G, np.float)
rsums = np.array(M.sum(1)) ## find sum of rows
ci, ri = M.nonzero() ## (i,j) that are not zero
for i in range(M.shape[1]): ## divede rows by the row sum
if rsums[i]:
M[i,:] /= rsums[i]
# bool array of sink states
sink = rsums==0
start_time = time.time()
# Compute pagerank r until we converge
ro, r = np.zeros(n), np.ones(n) ## a vector of zeros and another of ones of size n
while np.sum(np.abs(r-ro)) > maxerr and time.time() - start_time < 60:
ro = r.copy()
# calculate each pagerank at a time
for i in range(0,n):
# inlinks of state i
Ii = np.array(M[:,i]) ## get each row
# account for sink states
Si = sink / float(n)
# account for teleportation to state i
Ti = np.ones(n) / float(n)
r[i] = ro.dot( (Ii + Si)*s + Ti*(1-s) )
if time.time() - start_time >= 120:
logger.info('PageRank -- finished due to max allowed time')
# return normalized pagerank
return r/sum(r)
After increasing the time to 120, summaries are more likely to have similar sentences:
Please close the issue if this analysis seems right.
Hi, It's been quite a while since I haven't worked on that code, but yes, the use of time in page rank is the only thing that should produce indeterministic results. I wonder what machine you are running this on or how big the bug report is. It doesn't seem like the size is the problem, so maybe it's the machine?
Cheers and good luck :)
Rafael
On Wed, Oct 14, 2020 at 1:06 PM Arthur Marques notifications@github.com wrote:
As a follow-up, my current assumption is the Pagerank algorithm since s defines the prob of hopping to another link. Another issue might arise from while np.sum(np.abs(r-ro)) > maxerr and time.time() - start_time < 60 In a machine with low resources, the algorithm might finish before convergence.
def pageRank(G, s = .85, maxerr = .001): """ Args ---------- G: matrix representing state transitions Gij can be a boolean or non negative real number representing the transition weight from state i to j. In terms of votes or links Gij is the number of votes/links from i to j, therefore increasing the pagerank of node j. Kwargs ---------- s: probability of following a transition. 1-s probability of teleporting to another state. Defaults to 0.85 maxerr: if the sum of pageranks between iterations is bellow this we will have converged. Defaults to 0.001 """n = G.shape[0] ## number of nodes, since G is a square matrix
# transform G into markov matrix M M = np.array(G, np.float) rsums = np.array(M.sum(1)) ## find sum of rows ci, ri = M.nonzero() ## (i,j) that are not zero for i in range(M.shape[1]): ## divede rows by the row sum if rsums[i]: M[i,:] /= rsums[i] # bool array of sink states sink = rsums==0 start_time = time.time() # Compute pagerank r until we converge ro, r = np.zeros(n), np.ones(n) ## a vector of zeros and another of ones of size n while np.sum(np.abs(r-ro)) > maxerr and time.time() - start_time < 60: ro = r.copy() # calculate each pagerank at a time for i in range(0,n): # inlinks of state i Ii = np.array(M[:,i]) ## get each row # account for sink states Si = sink / float(n) # account for teleportation to state i Ti = np.ones(n) / float(n) r[i] = ro.dot( (Ii + Si)*s + Ti*(1-s) ) if time.time() - start_time >= 120: logger.info('PageRank -- finished due to max allowed time') # return normalized pagerank return r/sum(r)
After increasing the time to 120, summaries are more likely to have similar sentences:
[image: image] https://user-images.githubusercontent.com/1497480/96021861-db7d1d80-0e04-11eb-9904-39ed3a6cf7f4.png
Please close the issue if this analysis seems right.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/rafalotufo/summbug/issues/1#issuecomment-708536578, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGJCQUMDGMWKAS2DYJSCQLSKXLC7ANCNFSM4SQ3KB7Q .
Thanks for the fast response. Yeah. It's an old macOs with 8mb + a lengthy bug report + docker eating a lot of my memory. I'll run the code at the uni server and review it.
I've been digging replication packages for a while and I really appreciate all your work/code. It was so easy to Plug'n'Play :-)
Hi,
First, I would like to thank you for having the code publicly available. This is really helpful. I am trying to extend summbug to GitHub. Thus far, I had great success and I got to run the code and produce summaries. I am just trying to understand why the code produces different outputs every time I ran it for the same bug ID.
Consider https://github.com/flutter/flutter/issues/31598 the raw text for this issue is: (1266 characters)
A first summary has the following sentences: (240 characters, which is close to the 25% character threshold)
While a second summary has 270 characters:
So my question is why does the code produce different summaries if it is fed the same sentences?
I understand that it has been 6 years since the code was published and I will continue debugging and trying to understand it on my own. Nonetheless, I wonder if I missed something obvious that could explain the diff between summary 1 and 2
Thanks once again