sdsc-ordes / gimie

Extract linked metadata from repositories
https://sdsc-ordes.github.io/gimie/
Apache License 2.0
6 stars 1 forks source link

perf(license): tf-idf based matching #99

Closed cmdoret closed 10 months ago

cmdoret commented 10 months ago

Context

Up to now, license matching was done using the scancode-toolkit package, which has the following drawbacks:

Proposal

This PR replaces the rule-based scancode matcher with a probabilistic matcher based on Term Frequency-Inverse Document Frequency (TF-IDF). Given an input license, this implementation works as follows:

  1. Tokenize input license
  2. Compute tf-idf vector
  3. Compute cosine similarity against pre-computed tf-idf vectors of SPDX licenses
  4. Pick the license with the highest similarity if it is above a (conservative) similarity threshold

This implies:

Visual representation of TFIDF The process of computing TF-IDF vectors is illustrated below, with a corpus of 2 documents containing a single sentence each. ```mermaid graph TD subgraph Corpus D1[The GPL3 license] D2[The MIT license] C1["the, gpl3, license"] C2["the, mit, license"] end subgraph "Term-Frequency Matrix" F1["the: 1, gpl3: 1, license: 1"] F2["the: 1, mit: 1, license: 1"] TF["`TF (n_docs x n_terms)`"] end subgraph "Inverse Document Frequency Vector" IDF["IDF (1 x n_terms)"] end subgraph "TF-IDF matrix" TFIDF[TF-IDF] end D1 -->|tokenization| C1 D2 -->|tokenization| C2 C1 -->|counts| F1 C2 -->|counts| F2 F1 -->|build matrix| TF F2 -->|build matrix| TF TF -->|1 / Proportion of document containing term| IDF TF -->|multiply| TFIDF IDF -->|multiply| TFIDF ```

Changes

This PR implements 3 elements:

It also:

Alternative solution

Implementing and testing a tf-idf vectorizer might be considered outside the scope of gimie. The branch refactor/sklearn-tfidf drops the custom TfidfVectorizer and instead imports the scikit-learn implementation and uses skops to securely serialize / parse it (instead of pickle, which has security issues).

Both implementations yield the same results, but the serialized TfidfVectorizer from scikit-learn is much larger and slower to deserialize:

method file-size deserialization time
custom-tfidf 24kb 0.43ms
sklearn+skops 7.8Mb 223ms
sklearn+skops+zip-level9 564kb 232ms

Accuracy

Below are metrics computed on a sample of 2443 repositories from the paperswithcode links-between-papers-and-code source dataset link. The numbers are not exact for the following reasons:

full results table: tfidf_predictions_pwc.csv

When comparing the matched against the github-api results (excluding those where GitHub failed to identify the license), we get 97.2% accuracy.

detailed results Confusion matrix on the most common licenses: ![image](https://github.com/SDSC-ORD/gimie/assets/22558602/cc39fc92-23a7-431e-a231-2884f1c9779e) And the repositories for which the license was confidently assigned differently than GitHub (most have 2 or more licenses): |url |license_github |tfidf_pred | tfidf_cosine_similarity| |:---------------------------------------------|:--------------|:------------|-----------------------:| |https://github.com/HAWinther/MG-PICOLA-PUBLIC |GPL-2.0 |GPL-3.0 | 0.9799652| |https://github.com/zhongliliu/elastool |GPL-3.0 |GPL-2.0 | 0.9751945| |https://github.com/jerichooconnell/fastCAT |GPL-3.0 |AGPL-3.0 | 0.9806217| |https://github.com/SWIFTSIM/swiftsimio |LGPL-3.0 |GPL-3.0 | 0.9799652| |https://github.com/wenjiedu/brewpots |GPL-3.0 |BSD-3-Clause | 0.9227585| |https://github.com/nilesh2797/zestxml |BSD-3-Clause |BSD-2-Clause | 0.9029458| |https://github.com/bgris/odl |MPL-2.0 |OSET-PL-2.1 | 0.9216744| |https://github.com/marco-oliva/afm |MIT |GPL-3.0 | 0.9799652| |https://github.com/jsl03/apricot |GPL-3.0 |AGPL-3.0 | 0.9805915| |https://github.com/jakobrunge/tigramite |GPL-3.0 |AGPL-3.0 | 0.9806152|

Questions