Closed cmdoret closed 12 months ago
Skeleton example:
Corpus downloader:
import json
import requests
resp = requests.get("https://raw.githubusercontent.com/spdx/license-list-data/main/json/licenses.json")
all_licenses = json.loads(resp.text)
for i, lic in all_licenses['licenses']:
r = requests.get(lic["detailsUrl"])
lic["text"] = json.loads(r.text)["licenseText"]
with open('licenses.json', 'w') as fp:
json.dump(all_licenses, fp)
Vectorizer:
import json
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
with open('licenses.json', 'r') as fp:
all_licenses = json.load(fp)
corpus = [lic['text'] for lic in all_licenses['licenses']]
vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(corpus)
pickle.dump(vectorizer, open('vectorizer.pickle', 'wb'))
pickle.dump(tfidf, open('tfidf.pickle', 'wb'))
Note: since downloading + building the vectorizer is a one-time job (not performed by users), speed is not crucial. We may want to implement the vectorizer ourselves to avoid depending on sklearn + storing unneeded metadata in the vectorizer.
Resources:
Had some fun implementing a pure-python json-serializable TfidVectorizer. It has a subset of the parameters supported by sklearn and gives identical results. It is slower, but inference for a license takes ~0.2 sec.
https://gist.github.com/cmdoret/4ea255e8adb398938f9d5114a4dfd373
scancode-toolkit imposes speed and platform limitations. As we only use the library for license matching, it is hard to justify imposing these limitations on gimie.
We could probably implement a license matcher using a rule-based, distance or ML method.
Suggested approach: (truncated) TF-IDF based classification
Requirements:
gimie.sources.common.license.get_license_url()
to use vectorizer instead of scancodeNext (optional):
credits: Thanks @panaetius for the suggestion :)