seart-group / ghs

GitHub Search: Platform used to crawl, store and present projects from GitHub, as well as any statistics related to them
https://seart-ghs.si.usi.ch
MIT License
124 stars 13 forks source link

"Exclude Forks" Additional Filter gives a result still containing some forks #166

Open wolfenmark opened 10 months ago

wolfenmark commented 10 months ago

Description I got from GHS a list of projects with at least 10 contributors, 100 stars, 1000 commits, and explicitly requested GHS to exclude forks with the filter checkbox in the UI. When checking projects in the list, there are still some that are forks (9 out of about 12,000).

Replication These are two projects that were in the list and that you can use to reproduce the bug. If you search for them in GHS with Exclude Forks checked, they are returned, but if clicked, they clearly show as forks on GitHub (and are also forks according to the REST API):

To replicate: just search for "quaprosoft" or "anurodhp" with Exclude Forks checked, they're returned, click the link to the repo on GitHub and see the "forked from:" at the top.

I included two examples because the anurodhp/VaxProj was renamed from anurodhp/monal, but qaprosoft/carina was not. Both these projects had their last commit in 2021 (March and November respectively).

Other Info Other projects I didn't manually check but that were marked as forks by my analysis (if you need more cases to investigate):

1) Casing is not relevant for renames, reported as coming from my tool, but GitHub is case insensitive for repo names. The only rename example is anurodhp/monal -> anurodhp/VaxProj.

2) The list might be outdated since I have old data already analyzed from which I am getting them, but the two examples I manually checked are definitely still exhibiting the problem.

wolfenmark commented 10 months ago

The two example repositories also show stats in GHS that are different from the ones in the actual repo on GitHub.

Examples:

dabico commented 10 months ago

Given that the former fork is still reachable if you look for it by its old name, this leads me to believe that the project started under the old name as a non-fork, was deleted by the owner, and then re-created as a fork under the same name, before finally being renamed. The deletion would explain the drop in stars. However, this is all guesswork on my part, and it might even be better to reach out to @anurodhp directly for a timeline of events. A clear understanding of the project's lifecycle will help us avoid instances of this in the future.

The latter project was in all likelihood deleted and then re-created as a fork. Given that it never went over 10 stars and was not updated for a long time, we could never update the information. This does open a new can of worms: What should we do with projects that go below the star threshold, after they were mined? I think the best course of action would be to devise a new "maintenance" job that periodically checks repositories that have not been updated in a very long time, refreshing stale information, and removing the repository if it no longer satisfies the star criteria.

dabico commented 10 months ago

By the way, regarding the naming mismatches: GitHub does not discriminate casing in repository names. What I mean by this is that given a repository ghs by user @seart-group, the same user would not be able to create Ghs or GHS.

Example:

All of these API links point to the same repository. My point is that there is no difference between the actual name and its lower-case variant. We used to keep all names in lowercase (due to a misunderstanding by one of the maintainers), but I have since changed it to be stored as is displayed in GitHub. As a result, you may still see some repositories that were not updated in a long time have a case mismatch in the stored and actual name. But I guess that this will also be rectified with the proposed "maintenance" job.

wolfenmark commented 10 months ago

No issue with the names, indeed GitHub is case insensitive for repo names. I reported them as coming from my tool (didn't mean to imply a difference with the casing) to check if renames are possibly linked to the problem. The only rename example seems to be anurodhp/monal -> anurodhp/VaxProj. Added a note to the issue to clarify this.