sw360 / capycli

CaPyCLI - Python scripts for software license compliance automation with SW360
Other
13 stars 7 forks source link

GitHub tag matching #103

Open 16Martin opened 1 week ago

16Martin commented 1 week ago

This PR addresses https://github.com/sw360/capycli/issues/99 and introduces code intended to replace the current combination of get_github_info() followed by get_matching_tag() which exists only in capycli.bom.findsources.

This approach first tries to match a tag using the original get_matching_tag(). If all the guessing does not yield any results, the algo implicitly falls back to analyzing each tag with get_matching_tag().

16Martin commented 1 week ago

I cannot make the unittest work.

In test_find_golang_url_github() (https://github.com/sw360/capycli/blob/main/tests/test_find_sources.py#L329-L339) we expect the result of find_golang_url() to be 'https://pkg.go.dev/github.com/opencontainers/runc'. I do not understand this.

find_golang_url() has essentially two ways to set source_url, the variable that it ultimately returns:

  1. if len(split_version) == 3, in this case the source_url would contain /archive/
  2. in any other case source_url is set to the result of get_matching_tags, which is mocked to 'https://github.com/opencontainers/runc/archive/refs/tags/v1.0.1.zip', which also contains /archive/
  3. Every non-empty result of get_matching_tag() ends with .zip

How and why does this test work?

As a user, I would not feel good about getting https://pkg.go.dev/github.com/opencontainers/runc as a source URL, but maybe the variable names are misleading?

16Martin commented 5 days ago

Found it. The order in which the unittest sets up its mocks does not match the mock naming.

16Martin commented 4 days ago

Some functionality has been moved to protected methods in order to split the task into smaller, more focused parts. There is only one new public method: FindSources.version_to_github_tag().

The core of this PR are the lines in https://github.com/sw360/capycli/blob/martin/fix-github-tag-matching/capycli/bom/findsources.py#L278-L290:

While the current approach first fetch all tags that belong to a specific project and then passes the full list to get_matching_tag(), this new approach calls get_matching_tag() for each tag for each page of tags. As a result, if the new tag-guessing does not yield any matches,, the algo will match (or not match) the same tag the current approach matches.

The most important additions are lines 290 and following. If get_matching_tag() was unable to match the version to the tag, the logic generates eight possible candidates from the version and then checks if any of these candidates exist in the repo, before moving on to the next tag in the list.

The logic to create the candidates is (to some extend) the inverse of to_semver_string(). The idea is that if any of these candidates exists, get_matching_tag() would yield a positive match. The logic generating the candidates is to some extend the inverse of to_semver_string().

The algo then looks for each candidate in the current result-page and if that local lookup does not yield a match, then the algo queries the GitHub API and specifically asks if a tag with the candidate's name exists. If we can find a match through either of these two lookups, we use that match and stop the search.

With my BOMs, I notice a tremendous speedup. On average the guessing part finds a positive match immediately on the first results page. If it doesn't the API query is successful. Using my BOMs, the algo never fetches the second page of tags from GitHub.