ssc-oscar / lookup

A mirror of bitbucket.org/swcs/lookup
1 stars 4 forks source link

Commits SHAs from p2c not indexed? #9

Open k----n opened 3 years ago

k----n commented 3 years ago

For example, I run echo "liferay_liferay-portal" ~/lookup/getValues -f p2c | grep 4ec29d1bdde625673f844e2a44cc7d9095253b35 which means that a commit 4ec29d1bdde625673f844e2a44cc7d9095253b35 should exist.

This is what happens when I run the following:

> echo "4ec29d1bdde625673f844e2a44cc7d9095253b35" | ~/lookup/getValues c2ta
no 4ec29d1bdde625673f844e2a44cc7d9095253b35 in /data/basemaps/c2taFullS

> echo "4ec29d1bdde625673f844e2a44cc7d9095253b35"  | ~/lookup/showCnt commit
no commit 4ec29d1bdde625673f844e2a44cc7d9095253b35 in 78
k----n commented 3 years ago

The commit also still exists on github: https://github.com/liferay/liferay-portal/commit/4ec29d1bdde625673f844e2a44cc7d9095253b35

audrism commented 3 years ago

4ec29d1bdde625673f844e2a44cc7d9095253b35 is a regular commit (not among the bad commits listed in woc.pm) but is, indeed missing. The repo is updated regularly (and was updated for version S) but that specific commit was lost in the process, so hopefully it will get successfully extracted during the next collection.

k----n commented 3 years ago

@audrism I've also come across commits existing but not a project for them.

Are you interested in the list of commits?

audrism commented 3 years ago

Getting a list is easy: for inPonly:

join -v1 <(zcat c2PFullS0.s|uniq) <(zcat c2datFullS0.s| cut -d\; -f1)

for inConly:

join -v1 <(zcat c2PFullS0.s|uniq) <(zcat c2datFullS0.s| cut -d\; -f1)

What would be helpful is a scrip or audit process that tries to recover missing commits for inPonly and recovers projects for orphaned commits inConly.

While the first is traightforward in case the git repo is still online and has not been compacted, the second is more tricky:

use ghtorrent/SwHeritage?

k----n commented 3 years ago

ghtorrent and SwHeritage might not cover the most recent commits.

There is a way to search for it on github... but API limits: https://github.com/search?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Commits

Note that the CI bot has already deleted the branch, but the commit still shows up in a PR: image

https://github.com/liferay/liferay-portal/pull/3498/commits/4ec29d1bdde625673f844e2a44cc7d9095253b35

I guess you could also query https://github.com/<project/user name>/<repo>/commit/<sha1> to see if the commit still exists before exhausting API limits.
e.g. https://github.com/liferay/liferay-portal/commit/4ec29d1bdde625673f844e2a44cc7d9095253b35

But it doesn't have the metadata for whether or not the commit belongs to the repo vs getting this link from search: https://github.com/liferay/liferay-portal/pull/3498/commits/4ec29d1bdde625673f844e2a44cc7d9095253b35


SwHeritage returns no hits: https://archive.softwareheritage.org/browse/search/?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&with_visit=true&with_content=true&search_metadata=true

audrism commented 3 years ago

So the search is affected by api limits? the url does not appear to invoke rest/graphql api

k----n commented 3 years ago

Search has a limit of 30 requests/min with a token (https://docs.github.com/en/free-pro-team@latest/rest/reference/search#rate-limit).

You can also query for when your rate limit expires: https://docs.github.com/en/free-pro-team@latest/rest/reference/rate-limit

I imagine the lookup to be 2 steps:

  1. 9 non-api endpoints are queried for counts

    https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Users
    https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Wikis
    https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Topics
    https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Marketplace
    https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=RegistryPackages
    https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Discussions
    https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Issues
    https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Code
    https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Repositories

    Where https://github.com/search/count?q=4ec29d1bdde625673f844e2a44cc7d9095253b35&type=Issues returns 2.

  2. Based on the counts you can then use the official API:

    curl \
    >   -H "Accept: application/vnd.github.v3+json" \
    >   https://api.github.com/search/issues?q=4ec29d1bdde625673f844e2a44cc7d9095253b35

The 30 requests/min is limiting, and the non-api endpoints are also rate limited (although I'm unsure what it is exactly).

Your mileage may vary as well with getting useful results (the example works because the commit sha was included somewhere in the pull request body?). e.g.

"body": "Merging the following commit: [2f586e07928e14a424edfbf3b547a3881ca193f9](https://github.com/liferay/com-liferay-poshi-runner/commit/2f586e07928e14a424edfbf3b547a3881ca193f9)"
k----n commented 2 years ago

It seems like git clone --mirror <repo> also retrieves more commits

audrism commented 2 years ago

I use --mirror when cloning as it gets all the branches.