Open k----n opened 4 years ago
The commit also still exists on github: https://github.com/liferay/liferay-portal/commit/4ec29d1bdde625673f844e2a44cc7d9095253b35
4ec29d1bdde625673f844e2a44cc7d9095253b35 is a regular commit (not among the bad commits listed in woc.pm) but is, indeed missing. The repo is updated regularly (and was updated for version S) but that specific commit was lost in the process, so hopefully it will get successfully extracted during the next collection.
@audrism I've also come across commits existing but not a project for them.
Are you interested in the list of commits?
Getting a list is easy: for inPonly:
join -v1 <(zcat c2PFullS0.s|uniq) <(zcat c2datFullS0.s| cut -d\; -f1)
for inConly:
join -v1 <(zcat c2PFullS0.s|uniq) <(zcat c2datFullS0.s| cut -d\; -f1)
What would be helpful is a scrip or audit process that tries to recover missing commits for inPonly and recovers projects for orphaned commits inConly.
While the first is traightforward in case the git repo is still online and has not been compacted, the second is more tricky:
use ghtorrent/SwHeritage?
ghtorrent and SwHeritage might not cover the most recent commits.
There is a way to search for it on github... but API limits: https://github.com/search?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Commits
Note that the CI bot has already deleted the branch, but the commit still shows up in a PR:
https://github.com/liferay/liferay-portal/pull/3498/commits/4ec29d1bdde625673f844e2a44cc7d9095253b35
I guess you could also query https://github.com/<project/user name>/<repo>/commit/<sha1>
to see if the commit still exists before exhausting API limits.
e.g. https://github.com/liferay/liferay-portal/commit/4ec29d1bdde625673f844e2a44cc7d9095253b35
But it doesn't have the metadata for whether or not the commit belongs to the repo vs getting this link from search: https://github.com/liferay/liferay-portal/pull/3498/commits/4ec29d1bdde625673f844e2a44cc7d9095253b35
SwHeritage returns no hits: https://archive.softwareheritage.org/browse/search/?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&with_visit=true&with_content=true&search_metadata=true
So the search is affected by api limits? the url does not appear to invoke rest/graphql api
Search has a limit of 30 requests/min with a token (https://docs.github.com/en/free-pro-team@latest/rest/reference/search#rate-limit).
You can also query for when your rate limit expires: https://docs.github.com/en/free-pro-team@latest/rest/reference/rate-limit
I imagine the lookup to be 2 steps:
9 non-api endpoints are queried for counts
https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Users
https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Wikis
https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Topics
https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Marketplace
https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=RegistryPackages
https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Discussions
https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Issues
https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Code
https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Repositories
Where https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Issues
returns 2
.
Based on the counts you can then use the official API:
curl \
> -H "Accept: application/vnd.github.v3+json" \
> https://api.github.com/search/issues?q=4ec29d1bdde625673f844e2a44cc7d9095253b35
The 30 requests/min is limiting, and the non-api endpoints are also rate limited (although I'm unsure what it is exactly).
Your mileage may vary as well with getting useful results (the example works because the commit sha was included somewhere in the pull request body?). e.g.
"body": "Merging the following commit: [2f586e07928e14a424edfbf3b547a3881ca193f9](https://github.com/liferay/com-liferay-poshi-runner/commit/2f586e07928e14a424edfbf3b547a3881ca193f9)"
It seems like git clone --mirror <repo>
also retrieves more commits
I use --mirror when cloning as it gets all the branches.
For example, I run
echo "liferay_liferay-portal" ~/lookup/getValues -f p2c | grep 4ec29d1bdde625673f844e2a44cc7d9095253b35
which means that a commit 4ec29d1bdde625673f844e2a44cc7d9095253b35 should exist.This is what happens when I run the following: