ossf / scorecard

OpenSSF Scorecard - Security health metrics for Open Source
https://scorecard.dev
Apache License 2.0
4.46k stars 489 forks source link

BUG: Regression in LICENSE check, outputs 22,000 lines for details #2522

Closed pnacht closed 1 year ago

pnacht commented 1 year ago

Describe the bug The Gradle repo's JSON has 22,000 lines describing the License check.

"details": [
    "Info: License file found in expected location: LICENSE:1",
    "...",
    "Info: License file found in expected location: COPYING.MIT:1",
    "Info: FSF or OSI recognized license: COPYING.MIT:1",
    "...",
    "Info: License file found in expected location: LICENSE.md:1",
    "Info: FSF or OSI recognized license: LICENSE.md:1",
    "...",
    "Info: License file found in expected location: LICENSE.txt:1",
    "Info: FSF or OSI recognized license: LICENSE.txt:1",
    "...",
    "Warn: Any licence detected not an FSF or OSI recognized license: LICENSE:1",
    "...",
    "Info: License file found in expected location: MIT-LICENSE:1",
    "Info: FSF or OSI recognized license: MIT-LICENSE:1",
    "...",
    "Info: License file found in expected location: UNLICENSE:1",
    "Warn: Any licence detected not an FSF or OSI recognized license: UNLICENSE:1",
    "...",
]

where each "..." is dozens of repetitions of these results (mostly the basic "Info: ... LICENSE:1").

Also worth mentioning the only file that actually exists in the Gradle repo is LICENSE on the root.

I also checked cpython's json (which I'd seen two weeks ago and was perfectly fine), and it also has the same issue.

Reproduction steps

Both runs are dated "2022-11-28".

Expected behavior A single "Info: License file found in expected location: LICENSE:1" line in the details block (maybe a few other items?).

azeemshaikh38 commented 1 year ago

This is terrible! Can we please revert the change which caused this and make a minor release? @laurentsimon I believe you worked on this PR with a contributor?

laurentsimon commented 1 year ago

I'm unable to reproduce the problem locally. When running at head:

$ go run . --repo=gradle/gradle --format json --checks License --show-details | jq
...
"checks": [
    {
      "details": [
        "Info: License file found in expected location: LICENSE:1",
        "Info: FSF or OSI recognized license: LICENSE:1"
      ],
      "score": 10,
      "reason": "license file detected",
      "name": "License",
go run . --repo=python/cpython --format json --checks License --show-details | jq
{
      "details": [
        "Info: License file found in expected location: LICENSE:1",
        "Warn: Any licence detected not an FSF or OSI recognized license: LICENSE:1"
      ],

I'll chat with @pnacht offline

laurentsimon commented 1 year ago

/cc @shissam who worked on the PR. If you have ideas of what may be happening, let us know.

pnacht commented 1 year ago

I've also run this locally and the output is similar to what @laurentsimon shows above for both projects.

Not sure if this was a transitory bug or if there's a difference between what's happening in the CLI and the job. A good check will happen naturally in a few days when the projects are rerun by the job (the error happened in the 2022-11-28 run).

scorecard --repo=python/cpython --checks=License --show-details --format=json | jq
{
  "date": "2022-12-07",
  "repo": {
    "name": "github.com/python/cpython",
    "commit": "f3e97c90ed6f82fce67b0e8757eec54908ba49ce"
  },
  "scorecard": {
    "version": "4.8.0",
    "commit": "c40859202d739b31fd060ac5b30d17326cd74275"
  },
  "score": 10,
  "checks": [
    {
      "details": [
        "Info: : Doc/copyright.rst:1"
      ],
      "score": 10,
      "reason": "license file detected",
      "name": "License",
      "documentation": {
        "url": "https://github.com/ossf/scorecard/blob/c40859202d739b31fd060ac5b30d17326cd74275/docs/checks.md#license",
        "short": "Determines if the project has defined a license."
      }
    }
  ],
  "metadata": null
}
scorecard --repo=gradle/gradle --checks=License --show-details --format=json | jq
{
  "date": "2022-12-07",
  "repo": {
    "name": "github.com/gradle/gradle",
    "commit": "9716ea3d100eb07b60f44ad060b8d5f2ba192fd6"
  },
  "scorecard": {
    "version": "4.8.0",
    "commit": "c40859202d739b31fd060ac5b30d17326cd74275"
  },
  "score": 10,
  "checks": [
    {
      "details": [
        "Info: : LICENSE:1"
      ],
      "score": 10,
      "reason": "license file detected",
      "name": "License",
      "documentation": {
        "url": "https://github.com/ossf/scorecard/blob/c40859202d739b31fd060ac5b30d17326cd74275/docs/checks.md#license",
        "short": "Determines if the project has defined a license."
      }
    }
  ],
  "metadata": null
}
azeemshaikh38 commented 1 year ago

Last week's cron job output was 750GB instead of 6GB, so something was definitely off.

laurentsimon commented 1 year ago

@spencerschrock any ideas?

laurentsimon commented 1 year ago

Since we're pausing the cron job, can we use a test bucket and move these entries to the test buckets? This way they won't be accessible publicly, and we can investigate what is happening in the cron in the meantime. Wdut?

shissam commented 1 year ago

first: lines 89:91 of checks/evaluation/license.go is a loop over r.LicenseFile which for all intense and purpose should only have one result in the range. How checks/raw/license.go would construct such a large result set over which to iterate and cause details to produce 22,000’ish lines repeating the same information to checker.LogMessage() is a bit of a mystery to me at the moment. (BTW, upon closer inspection it is NOT repeating the same information -- see below)

next: checks/raw/license.go make s single, non-loop/non-iterating call to GH License API c.RepoClient.ListLicenses() at line 127 in that same file. On return from that call - c.RepoClient.ListLicenses() could return a range of licenseFound upon which to return for evaluation (there is an append in the iteration aafter the call to c.RepoClient.ListLicenses()— but the actual code for c.RepoClient.ListLicenses() is not in a loop and only makes one GH API invocation (see lines 47:90 of clients/githubrepo/licenses.go). That one call from c.RepoClient.ListLicenses() should only result in one (1) result set upon which the caller ( checks/raw/license.go) would iterate over, and hence, would only allow for one results set to be interpreted by checks/evaluation/license.go.

Inspection of that function (again lines 47:90 of clients/githubrepo/licenses.go) is in an immediate func() handler for handler.once.Do and I cannot imagine why this would cause the code to call the GH API (on the order of) 10,000 times (i.e., 2 checker.LogMessage()’s for each invocation to the GH APIinvocation).

finally: with all that said, when looking at the 22,000 lines of results from the scorecard REST API reported, it should be noted that 63 uniq files names are being reported as being “found”, for example: LICENSE, copyright.c, OFL.txt, LICENSE.mkd just to name a few.

Focusing specifically on LICENSE.mkd I can say this for certain:

filename LICENSE.mkd is a clue which I think warrants further thought: 1) I don’t see how it can some from scorecard source code (like a test condition or test filename as seen in checks/raw/license_test.go (not that I could see how that could leak into production code)) 2) the file is simply not in the gradle/gradle source tree (and subprojects that I could find)

So… then… where is it ( LICENSE.mkd ) coming from.

The last clue is the output from the 22,000 lines from the REST API, when I looked at it tonight, I notice not all the results were actually the same (besides the 63 unique filenames). For instance, one pairing of LogMessages report

        "Info: License file found in expected location: LICENSE:1”,
        "Info: FSF or OSI recognized license: LICENSE:1",

whilst another pairing reports

        "Info: License file found in expected location: LICENSE:1",
        "Warn: Any licence detected not an FSF or OSI recognized license: LICENSE:1",

Same filename (LICENSE) but different results from checks/raw/license.go. In the first pairing the file found had to have an SpdxID which was found to be a FSF or OSI license, hence the Info, where the other pairing having the same name had to have had an SpdxID which did not meet that same condition (a bool in checks/evaluation/license.go), hence the Warn

These two clues (same filename but different licenses recognized, one FSF/OSI and the other “not") and the other clue(s) where filenames detected actually do not appear on gradle/gradle seems to hint to me that results from OTHER repos are getting intermingled — somehow — into these results. I simply cannot explain it at this time with the information I have in front of me and the inability to reproduce the problem from the CLI. My head is moving in the direction of parallel threads intermingling somehow - I just don't know how. I followed the pattern for ListLanguages which also uses the GH API. I could certainly be wrong, but it is one path of investigation that I believe (but do not know) which is different from the CLI and what I could imagine (possible) occurring in the cron environment (that is the parallel intermingling).

shissam commented 1 year ago

BTW, I forgot to check, but for the REST API for cpython generated 77 unique filenames for the license check and like I noted earlier, gradle had 63 unique filenames. Combined those two data sets (interesting that 22,000'ish comes up again for both) shows 107 unique filenames - the more common names shared between those two data sets (62 to be precise) and 45 being unique to either the cypthon or gradle data sets (those names are more obscure). Two of the obscure file names I want to call attention to are:

As for the first one, that is very unique--where did that come from? That second one is interesting to me as that name is one that the old license check search algorithm would get a false positive when trying to locate a license file. The new license check which uses the GH API I can't imaging how it would find that, and the new builtin heuristic would simply reject .json outright - I just tested that name on my CLI code here to make sure I was not crazy. So how that one specifically (like LICENSE.mkd (another invalid filename extension)) should never even got into the mix. And to note, the new builtin heuristic would only kick in if the call to c.RepoClient.ListLicenses() returned clients.ErrUnsupportedFeature which should only occur for GitLab repo clients.. and certainly NOT GitHub repo clients.

shissam commented 1 year ago

So - I have been thinking about this all night and I have a question... I'd like to dig into this more, and noticed that e2e/license_test.go is not a parallel test, like the unit-tests are (for example for TestLicense in checks/raw/license_test.go) - my suspicion is cron is parallel and that is a contributing problem, If I could parallelize an test like e2e/license_test.go I bit I can reproduce the problem. I'll continue to dig, but if anyone has a idea how best to proceed - that might speed up my analysis.

laurentsimon commented 1 year ago

Sounds like a good idea. Please let us know what you find. Thank you so much for digging into this @shissam !

shissam commented 1 year ago

@laurentsimon I could (possibly) go faster if there was someone I could chat with more about how the cron environment parallelizes its work. I can discover that myself at the expense of time.

laurentsimon commented 1 year ago

@azeemshaikh38 @spencerschrock own most of the cron job code. Could you point @shissam in the right direction?

azeemshaikh38 commented 1 year ago

@shissam the cron job analyses repos sequentially. The clients passed into RunScorecard are persisted to avoid re-initializing them repeatedly. Maybe this is the source of your bug? You are possibly saving state in one of these clients that is getting persisted across repos?

This is where the Scorecard cron call this fn - https://github.com/ossf/scorecard/blob/main/cron/internal/worker/main.go#L167

shissam commented 1 year ago

@azeemshaikh38 OMG (Goodness) - I think you've clarified that for me! Look at this call:

https://github.com/ossf/scorecard/blob/main/clients/githubrepo/licenses.go#L80

With what you said, is what I am now thinking occurring here - that is handler.licenses is persisting in the client handler as the repo changes as cron walks through all the sites?

I think that fix is to change the code there from:

handler.licenses = append(handler.licenses, clients.License{
    Key:    bodyJSON.GetLicense().GetKey(),
    Name:   bodyJSON.GetLicense().GetName(),
    SPDXId: bodyJSON.GetLicense().GetSPDXID(),
    Path:   bodyJSON.GetName(),
    Type:   bodyJSON.GetType(),
    Size:   bodyJSON.GetSize(),
},
)

to

handler.licenses = clients.License{
    Key:    bodyJSON.GetLicense().GetKey(),
    Name:   bodyJSON.GetLicense().GetName(),
    SPDXId: bodyJSON.GetLicense().GetSPDXID(),
    Path:   bodyJSON.GetName(),
    Type:   bodyJSON.GetType(),
    Size:   bodyJSON.GetSize(),
}

The question, from a Scorecard perspective, is how should this client API "clean up" afterwards? Like I stated earlier, I followed the ListLicenses pattern -- so why is the check that incorporates ListLicenses not sharing in the same problem.

I think we're on the path to solving this.

shissam commented 1 year ago

Wait a minute... looking back at worker/main the "main" loop shows

    for _, repoReq := range batchRequest.GetRepos() {
        logger.Info(fmt.Sprintf("Running Scorecard for repo: %s", *repoReq.Url))
        repo, err := githubrepo.MakeGithubRepo(*repoReq.Url)

is githubrepo.MakeGithubRepo different, then, from githubrepo.CreateGithubRepoClient and the handler state (i.e., handler.licenses) remains unchanged as a result of githubrepo.MakeGithubRepo as opposed to githubrepo.CreateGithubRepoClient?

shissam commented 1 year ago

Here is the actual change I made to clients/githubrepo/licenses.go and unit-test and e2e-gh-token tests pass as expected.

@azeemshaikh38 @laurentsimon @spencerschrock As you can see in the code, the state is restored to its initial state, before the response from the GH API is processed--this appears to me at the moment to have been the problem. It explains much of the behavior.

Do y'all have a test environment which can run/simulate the cron run?

diff --git a/clients/githubrepo/licenses.go b/clients/githubrepo/licenses.go
index df812f3..747567a 100644
--- a/clients/githubrepo/licenses.go
+++ b/clients/githubrepo/licenses.go
@@ -74,6 +74,9 @@ func (handler *licensesHandler) setup() error {
                        return
                }

+               // cron does not rebuild/reinitialize a client, this is state
+               // than needs to be reinitialized
+               handler.licenses = []clients.License{}
                // TODO: github.RepositoryLicense{} only supports one license per repo
                //       should that change to an array of licenses, the change would
                //       be here to iterate over any such range.