Component version to GitHub tag matching.

16Martin commented 2 weeks ago

I have been experiencing issues in bom findsources with capycli's GitHub interaction. Jobs take unexpectedly long and the memory consumption is correspondingly high (but isn't an issue in itself). I use capycli to process relatively large BOMs and, according to capycli's findings, I frequently have 400-500 third party components from GitHub.

I tracked these issues to how findsources maps component versions to tags on GitHub. Currently, capycli first retrieves the full list of a project's tags (get_github_info() in capycli.bom.findsources) and then iterates over this list, hoping to find a match to the version provided as a parameter to get_matching_tag().

There are projects like the tencentcloud sdk with tens of thousands of tags. Using the GitHub API, capycli has to retrieve these at chunks of 100 tags per call using Python's synchronous IO.

On average, get_matching_tag() does 109 negative comparisons for each tag it matches. This means on average in my use cases capycli has to fetch two pages worth of tags to match a component. This is amounts to retrieving tencentcloud sdk alone.

As far as I can tell, ...

get_github_info() is only ever used twice with both occurrences in capycli.bom.findsources. Both uses virtually directly feed into get_matching_tag().
get_matching_tag() is only ever used three times with all occurrences in capycli.bomfindsources. All uses are essentially immediately return-ed

Are there any uses of these methods I missed?

16Martin commented 2 weeks ago

I am working on a new implementation that replaces get_github_info() and get_matching_tag().

While get_github_info() used to fetch all the tags and get_matching_tag() would search the full result set, the new approach joins these two methods and searches page by page as they are retrieved from the GitHub API.

The current implementation is based on the assumption that for each release version there is a corresponding tag in the repository and that the release version is encoded in the tags name, which can be retrieved using to_semver_string(). The new approach builds on that assumption even further and tries to guess the correct tag name from a tag that corresponds to a non-matching version.

Based on the assumption that projects follow a scheme for tags that encodes a semantic version (-like) into a tag,

inverted implementation of to_semver_string(). Instead of inferring a semver from a tag, this inverse implementation will infer a tag from a semver after analysing a tag retrieved from GitHub.

tngraf commented 1 week ago

Sounds good so far. Please ensure that the old implementation is still working as a fallback in the case that the apporach of your new and faster way does not work.

sw360 / capycli

Component version to GitHub tag matching. #99