useful-forks / useful-forks.github.io

Improving GitHub's Forks list discoverability through automatic filtering. The project offers an online tool and a Chrome extension.
https://useful-forks.github.io/
MIT License
1.18k stars 61 forks source link

Duplicate results #59

Open pepa65 opened 1 year ago

pepa65 commented 1 year ago

Not sure if this is to be expected somehow, but I noticed nothing in the Issues about this: I'm getting lots of duplicate results displayed, the same lines with the same values. When exported to csv, it is the same.

For example, the file useful-forks-https\ __github.com_zyedidia_micro_.csv has 104 lines (including the header), while with duplicate lines removed, it only has 59.

payne911 commented 1 year ago

Good catch. For posterity, here's the result CSV content of the repo this user is talking about: https://justpaste.it/8evui

Worth noting: the CSV code is not responsible. The result table itself contains duplicates. Subset example on the unfinished scan (this particular entry actually appears 3 times in the full table): image

And then if I look at the Network tab in the Chrome Dev Tools, we see that it seems like GitHub's API is returning duplicates: image

The simplest fix to this is to keep a HashSet of all entries added in the table, and to only append new rows if the repo isn't in the HashSet. Interestingly, this filter could happen earlier, thus potentially saving a few network calls (i.e. avoid even requesting for the fork's information if it already appears in the table). If someone feels like implementing that, I'll gladly review the PR.

I'll label this issue as a bug, despite the bug not being in our codebase.

Ethkuil commented 1 year ago

The simplest fix to this is to keep a HashSet of all entries added in the table, and to only append new rows if the repo isn't in the HashSet.

The even better fix is to change TABLE_DATA into a set instead of an array.

payne911 commented 1 year ago

to change TABLE_DATA into a set instead of an array.

Theoretically, a set has no ordering. Thus, the operation of sorting would make no sense against that data structure.

Also, it seems like JavaScript has no built-in set (and thus I regret saying "The simplest fix to this is to keep a HashSet"). We'd have to make one, and at that point it might become not worth it: the extra code increases complexity and reduces maintainability, while not providing much performance gains because we are talking about relatively small tables.

The simplest approach would be to simply scan the entire table array to check for a duplicate: that's an acceptable O(n).

payne911 commented 1 year ago

Update: I have the fix ready. I'm simply waiting on #58 to get merged in as it'll build on it.

payne911 commented 1 year ago

Odd. This issue is still happening, but it's not consistent. image

There's most probably a race-condition happening. Could be in the Octokit library, or straight in my code. I'll have to try to manually bisect.

Or maybe my JavaScript knowledge is too narrow for me to even find the source of this bug. ¯\_(ツ)_/¯

payne911 commented 1 year ago

https://github.com/useful-forks/useful-forks.github.io/commit/229d0ce308dc47795d7323af9b0a128a0ef7f4ec

This was what I thought would fix it.

andy0130tw commented 1 year ago

It's this the set you want?

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Set

payne911 commented 1 year ago

Yes, although technically given I have done an implementation with a list which should have solved this issue, it is clear that the problem is not what I thought it was.

There's most probably a race-condition happening. Could be in the Octokit library, or straight in my code

Feel free to investigate as I do not have much free time these days.