perf(license): tf-idf based matching

Context

Up to now, license matching was done using the scancode-toolkit package, which has the following drawbacks:

It depends on compiled packages that are not available on arm64 (i.e. newer macbooks)
It is slow (2.7 seconds to match an Apache-2.0 license)
It has many dependencies

Proposal

This PR replaces the rule-based scancode matcher with a probabilistic matcher based on Term Frequency-Inverse Document Frequency (TF-IDF). Given an input license, this implementation works as follows:

Tokenize input license
Compute tf-idf vector
Compute cosine similarity against pre-computed tf-idf vectors of SPDX licenses
Pick the license with the highest similarity if it is above a (conservative) similarity threshold

This implies:

We need to ship a matrix of pre-computed tf-idf vectors and a fitted tf-idf vectorizer with the package, making it a bit heavier
We need a script to re-compute these vectors

Visual representation of TFIDF

Changes

This PR implements 3 elements:

A tf-idf vectorizer that can be serialized / parsed to json (in gimie.utils.text)
A script to download SPDX licenses and regenerate the pre-computed files (in scripts/generate_tfidf.py)
Adapt LicenseParser to use this tf-idf vectorizer ( in gimie.parsers.license)

It also:

Updates dependencies (-scancode, +pydantic, +scipy)
Update the supported python versions from 3.8-3.11 -> 3.9-3.12
Embeds the pre-computed tf-idf vectors and fitted vectorizer in gimie/parsers/license/data

Alternative solution

Implementing and testing a tf-idf vectorizer might be considered outside the scope of gimie. The branch refactor/sklearn-tfidf drops the custom TfidfVectorizer and instead imports the scikit-learn implementation and uses skops to securely serialize / parse it (instead of pickle, which has security issues).

Both implementations yield the same results, but the serialized TfidfVectorizer from scikit-learn is much larger and slower to deserialize:

method	file-size	deserialization time
custom-tfidf	24kb	0.43ms
sklearn+skops	7.8Mb	223ms
sklearn+skops+zip-level9	564kb	232ms

Accuracy

Below are metrics computed on a sample of 2443 repositories from the paperswithcode links-between-papers-and-code source dataset link. The numbers are not exact for the following reasons:

Some repositories have multiple licenses, only one arbitrary license file was considered here.
When a license file contains multiple concatenated licenses, the GitHub API sometimes predicts only first one (instead of returning NOASSERTION).

full results table: tfidf_predictions_pwc.csv

When comparing the matched against the github-api results (excluding those where GitHub failed to identify the license), we get 97.2% accuracy.

detailed results

Confusion matrix on the most common licenses: ![image](https://github.com/SDSC-ORD/gimie/assets/22558602/cc39fc92-23a7-431e-a231-2884f1c9779e) And the repositories for which the license was confidently assigned differently than GitHub (most have 2 or more licenses): |url |license_github |tfidf_pred | tfidf_cosine_similarity| |:---------------------------------------------|:--------------|:------------|-----------------------:| |https://github.com/HAWinther/MG-PICOLA-PUBLIC |GPL-2.0 |GPL-3.0 | 0.9799652| |https://github.com/zhongliliu/elastool |GPL-3.0 |GPL-2.0 | 0.9751945| |https://github.com/jerichooconnell/fastCAT |GPL-3.0 |AGPL-3.0 | 0.9806217| |https://github.com/SWIFTSIM/swiftsimio |LGPL-3.0 |GPL-3.0 | 0.9799652| |https://github.com/wenjiedu/brewpots |GPL-3.0 |BSD-3-Clause | 0.9227585| |https://github.com/nilesh2797/zestxml |BSD-3-Clause |BSD-2-Clause | 0.9029458| |https://github.com/bgris/odl |MPL-2.0 |OSET-PL-2.1 | 0.9216744| |https://github.com/marco-oliva/afm |MIT |GPL-3.0 | 0.9799652| |https://github.com/jsl03/apricot |GPL-3.0 |AGPL-3.0 | 0.9805915| |https://github.com/jakobrunge/tigramite |GPL-3.0 |AGPL-3.0 | 0.9806152|

Questions

Can we tolerate a small margin of error when attributing licenses? We can adjust the threshold if needed.
Do we prefer the faster and lighter custom implementation, or reducing the amount of code by using scipy?
- Tradeoff: 0.23s and 5kb vs 173 lines of code (+150 lines of comments)

sdsc-ordes / gimie