Open pepa65 opened 1 year ago
Good catch. For posterity, here's the result CSV content of the repo this user is talking about: https://justpaste.it/8evui
Worth noting: the CSV code is not responsible. The result table itself contains duplicates. Subset example on the unfinished scan (this particular entry actually appears 3 times in the full table):
And then if I look at the Network tab in the Chrome Dev Tools, we see that it seems like GitHub's API is returning duplicates:
The simplest fix to this is to keep a HashSet of all entries added in the table, and to only append new rows if the repo isn't in the HashSet. Interestingly, this filter could happen earlier, thus potentially saving a few network calls (i.e. avoid even requesting for the fork's information if it already appears in the table). If someone feels like implementing that, I'll gladly review the PR.
I'll label this issue as a bug, despite the bug not being in our codebase.
The simplest fix to this is to keep a HashSet of all entries added in the table, and to only append new rows if the repo isn't in the HashSet.
The even better fix is to change TABLE_DATA
into a set instead of an array.
to change TABLE_DATA into a set instead of an array.
Theoretically, a set has no ordering. Thus, the operation of sorting would make no sense against that data structure.
Also, it seems like JavaScript has no built-in set (and thus I regret saying "The simplest fix to this is to keep a HashSet"). We'd have to make one, and at that point it might become not worth it: the extra code increases complexity and reduces maintainability, while not providing much performance gains because we are talking about relatively small tables.
The simplest approach would be to simply scan the entire table array to check for a duplicate: that's an acceptable O(n)
.
Update: I have the fix ready. I'm simply waiting on #58 to get merged in as it'll build on it.
Odd. This issue is still happening, but it's not consistent.
There's most probably a race-condition happening. Could be in the Octokit library, or straight in my code. I'll have to try to manually bisect.
Or maybe my JavaScript knowledge is too narrow for me to even find the source of this bug. ¯\_(ツ)_/¯
This was what I thought would fix it.
It's this the set you want?
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Set
Yes, although technically given I have done an implementation with a list which should have solved this issue, it is clear that the problem is not what I thought it was.
There's most probably a race-condition happening. Could be in the Octokit library, or straight in my code
Feel free to investigate as I do not have much free time these days.
Not sure if this is to be expected somehow, but I noticed nothing in the Issues about this: I'm getting lots of duplicate results displayed, the same lines with the same values. When exported to csv, it is the same.
For example, the file
useful-forks-https\ __github.com_zyedidia_micro_.csv
has 104 lines (including the header), while with duplicate lines removed, it only has 59.