Up to now, license matching was done using the scancode-toolkit package, which has the following drawbacks:
It depends on compiled packages that are not available on arm64 (i.e. newer macbooks)
It is slow (2.7 seconds to match an Apache-2.0 license)
It has many dependencies
Proposal
This PR replaces the rule-based scancode matcher with a probabilistic matcher based on Term Frequency-Inverse Document Frequency (TF-IDF). Given an input license, this implementation works as follows:
Tokenize input license
Compute tf-idf vector
Compute cosine similarity against pre-computed tf-idf vectors of SPDX licenses
Pick the license with the highest similarity if it is above a (conservative) similarity threshold
This implies:
We need to ship a matrix of pre-computed tf-idf vectors and a fitted tf-idf vectorizer with the package, making it a bit heavier
We need a script to re-compute these vectors
Visual representation of TFIDF
The process of computing TF-IDF vectors is illustrated below, with a corpus of 2 documents containing a single sentence each.
```mermaid
graph TD
subgraph Corpus
D1[The GPL3 license]
D2[The MIT license]
C1["the, gpl3, license"]
C2["the, mit, license"]
end
subgraph "Term-Frequency Matrix"
F1["the: 1, gpl3: 1, license: 1"]
F2["the: 1, mit: 1, license: 1"]
TF["`TF (n_docs x n_terms)`"]
end
subgraph "Inverse Document Frequency Vector"
IDF["IDF (1 x n_terms)"]
end
subgraph "TF-IDF matrix"
TFIDF[TF-IDF]
end
D1 -->|tokenization| C1
D2 -->|tokenization| C2
C1 -->|counts| F1
C2 -->|counts| F2
F1 -->|build matrix| TF
F2 -->|build matrix| TF
TF -->|1 / Proportion of document containing term| IDF
TF -->|multiply| TFIDF
IDF -->|multiply| TFIDF
```
Changes
This PR implements 3 elements:
A tf-idf vectorizer that can be serialized / parsed to json (in gimie.utils.text)
A script to download SPDX licenses and regenerate the pre-computed files (in scripts/generate_tfidf.py)
Adapt LicenseParser to use this tf-idf vectorizer ( in gimie.parsers.license)
Update the supported python versions from 3.8-3.11 -> 3.9-3.12
Embeds the pre-computed tf-idf vectors and fitted vectorizer in gimie/parsers/license/data
Alternative solution
Implementing and testing a tf-idf vectorizer might be considered outside the scope of gimie.
The branch refactor/sklearn-tfidf drops the custom TfidfVectorizer and instead imports the scikit-learn implementation and uses skops to securely serialize / parse it (instead of pickle, which has security issues).
Both implementations yield the same results, but the serialized TfidfVectorizer from scikit-learn is much larger and slower to deserialize:
method
file-size
deserialization time
custom-tfidf
24kb
0.43ms
sklearn+skops
7.8Mb
223ms
sklearn+skops+zip-level9
564kb
232ms
Accuracy
Below are metrics computed on a sample of 2443 repositories from the paperswithcode links-between-papers-and-codesource dataset link. The numbers are not exact for the following reasons:
Some repositories have multiple licenses, only one arbitrary license file was considered here.
When a license file contains multiple concatenated licenses, the GitHub API sometimes predicts only first one (instead of returning NOASSERTION).
Context
Up to now, license matching was done using the
scancode-toolkit
package, which has the following drawbacks:Proposal
This PR replaces the rule-based scancode matcher with a probabilistic matcher based on Term Frequency-Inverse Document Frequency (TF-IDF). Given an input license, this implementation works as follows:
This implies:
Visual representation of TFIDF
The process of computing TF-IDF vectors is illustrated below, with a corpus of 2 documents containing a single sentence each. ```mermaid graph TD subgraph Corpus D1[The GPL3 license] D2[The MIT license] C1["the, gpl3, license"] C2["the, mit, license"] end subgraph "Term-Frequency Matrix" F1["the: 1, gpl3: 1, license: 1"] F2["the: 1, mit: 1, license: 1"] TF["`TF (n_docs x n_terms)`"] end subgraph "Inverse Document Frequency Vector" IDF["IDF (1 x n_terms)"] end subgraph "TF-IDF matrix" TFIDF[TF-IDF] end D1 -->|tokenization| C1 D2 -->|tokenization| C2 C1 -->|counts| F1 C2 -->|counts| F2 F1 -->|build matrix| TF F2 -->|build matrix| TF TF -->|1 / Proportion of document containing term| IDF TF -->|multiply| TFIDF IDF -->|multiply| TFIDF ```Changes
This PR implements 3 elements:
gimie.utils.text
)scripts/generate_tfidf.py
)LicenseParser
to use this tf-idf vectorizer ( ingimie.parsers.license
)It also:
Updates dependencies (-
scancode
, +pydantic
, +scipy
)Update the supported python versions from
3.8
-3.11
->3.9
-3.12
Embeds the pre-computed tf-idf vectors and fitted vectorizer in
gimie/parsers/license/data
Alternative solution
Implementing and testing a tf-idf vectorizer might be considered outside the scope of gimie. The branch
refactor/sklearn-tfidf
drops the customTfidfVectorizer
and instead imports the scikit-learn implementation and usesskops
to securely serialize / parse it (instead of pickle, which has security issues).Both implementations yield the same results, but the serialized TfidfVectorizer from scikit-learn is much larger and slower to deserialize:
Accuracy
Below are metrics computed on a sample of 2443 repositories from the paperswithcode
links-between-papers-and-code
source dataset link. The numbers are not exact for the following reasons:full results table: tfidf_predictions_pwc.csv
When comparing the matched against the github-api results (excluding those where GitHub failed to identify the license), we get 97.2% accuracy.
detailed results
Confusion matrix on the most common licenses: ![image](https://github.com/SDSC-ORD/gimie/assets/22558602/cc39fc92-23a7-431e-a231-2884f1c9779e) And the repositories for which the license was confidently assigned differently than GitHub (most have 2 or more licenses): |url |license_github |tfidf_pred | tfidf_cosine_similarity| |:---------------------------------------------|:--------------|:------------|-----------------------:| |https://github.com/HAWinther/MG-PICOLA-PUBLIC |GPL-2.0 |GPL-3.0 | 0.9799652| |https://github.com/zhongliliu/elastool |GPL-3.0 |GPL-2.0 | 0.9751945| |https://github.com/jerichooconnell/fastCAT |GPL-3.0 |AGPL-3.0 | 0.9806217| |https://github.com/SWIFTSIM/swiftsimio |LGPL-3.0 |GPL-3.0 | 0.9799652| |https://github.com/wenjiedu/brewpots |GPL-3.0 |BSD-3-Clause | 0.9227585| |https://github.com/nilesh2797/zestxml |BSD-3-Clause |BSD-2-Clause | 0.9029458| |https://github.com/bgris/odl |MPL-2.0 |OSET-PL-2.1 | 0.9216744| |https://github.com/marco-oliva/afm |MIT |GPL-3.0 | 0.9799652| |https://github.com/jsl03/apricot |GPL-3.0 |AGPL-3.0 | 0.9805915| |https://github.com/jakobrunge/tigramite |GPL-3.0 |AGPL-3.0 | 0.9806152|Questions