The ESCO Playground is a repository to play with the ESCO dataset, and to test different approaches to extract skills from text.
:warning: This is a work in progress, and it is not ready for production.
To install the development version of the package, you can use pip:
pip install git+https://github.com/par-tec/esco-playground
Optional dependencies can be installed via:
pip install esco[langchain]
pip install esco[dev]
The simplest way to use this module is via the LocalDB
class,
that wraps the ESCO dataset embedded in the package via the json files:
from esco import LocalDB
esco_data = LocalDB()
# Get a skill by its cURIe.
skill = esco_data.get("esco:b0096dc5-2e2d-4bc1-8172-05bf486c3968")
# Search a list of skill using labels.
skills = esco_data.search_products({"python", "java"})
# Further queries can be done using the embedded dataframe.
esco_data.skills.__class__ == pandas.core.frame.DataFrame
esco_data.skills[esco_data.skills.label == "SQL Server"]
To use extra features such as text to skill extraction you need to install the optional dependencies (which are really slow if you don't have a GPU).
pip install esco[langchain]
Use the EscoCV
and the Ner
classes to extract skills from text:
from esco.cv import EscoCV
from esco import LocalDB
from esco.ner import Ner
# Initialize the vector index (slow) on disk.
# This can be reused later.
datadir = Path("/tmp/esco-tmpdir")
datadir.mkdir(exist_ok=True)
cfg = {
"path": datadir / "esco-skills",
"collection_name": "esco-skills",
}
db = LocalDB()
db.create_vector_idx(cfg)
db.close()
# Now you can create a new db that loads the vector index.
db = LocalDB(vector_idx_config=cfg)
# and a recognizer class that used both the ESCO dataset and the vector index.
cv_recognizer = Ner(db=db, tokenizer=nltk.sent_tokenize)
# Now you can use the recognizer to extract skills from text.
cv_text = """I am a software developer with 5 years of experience in Python and Java."""
cv = cv_recognizer(text)
# This will take some time.
cv_skills = cv.skills()
If you have a sparql server with the ESCO dataset, you can use the SparqlClient
:
from esco.sparql import SparqlClient
client = SparqlClient("http://localhost:8890/sparql")
skills_df = client.load_skills()
occupations_df = client.load_occupations()
# You can even use custom queries returning a CSV.
query = """SELECT ?skill ?label
WHERE {
?skill a esco:Skill .
?skill skos:prefLabel ?label .
FILTER (lang(?label) = 'en')
}"""
skills = client.query(query)
The jupyter notebook should work without the ESCO dataset,
since an excerpt of the dataset is already included in esco.json.gz
.
To regenerate the NER model, you need the ESCO dataset in turtle format.
:warning: before using this repository, you need to:
download the ESCO 1.1.1 database in text/turtle format
ESCO dataset - v1.1.1 - classification - - ttl.zip
from the ESCO portal
and unzip the .ttl
file under the vocabularies
folder.
execute the sparql server that will be used to serve the ESCO dataset, and wait for the server to spin up and load the ~700MB dataset. :warning: It will take a couple of minutes, so you need to wait for the server to be ready.
docker-compose up -d virtuoso
run the tests using tox
tox -e py3
or using the docker-compose file
docker compose up test
To regenerate the model, you need to setup the ESCO dataset as explained above and then run the following command:
tox -e model
To build and upload the model, provided you did huggingface-cli login
:
tox -e model -- upload
## Contributing
Please, see [CONTRIBUTING.md](CONTRIBUTING.md) for more details on:
- using [pre-commit](CONTRIBUTING.md#pre-commit);
- following the git flow and making good [pull requests](CONTRIBUTING.md#making-a-pr).
## Using this repository
You can create new projects starting from this repository,
so you can use a consistent CI and checks for different projects.
Besides all the explanations in the [CONTRIBUTING.md](CONTRIBUTING.md) file,
you can use the docker-compose file
(e.g. if you prefer to use docker instead of installing the tools locally)
```bash
docker-compose run pre-commit
If you need a GPU server, you can
create a new GPU machine using the pre-built debian-11-py310
image.
The command is roughly the following
gcloud compute instances create instance-2 \
--machine-type=n1-standard-4 \
--create-disk=auto-delete=yes,boot=yes,device-name=instance-1,image=projects/ml-images/global/images/c0-deeplearning-common-gpu-v20231209-debian-11-py310,mode=rw,size=80,type=projects/${PROJECT}/zones/europe-west1-b/diskTypes/pd-standard \
--no-restart-on-failure \
--maintenance-policy=TERMINATE \
--provisioning-model=STANDARD \
--accelerator=count=1,type=nvidia-tesla-t4 \
--no-shielded-secure-boot \
--shielded-vtpm \
--shielded-integrity-monitoring \
--labels=goog-ec-src=vm_add-gcloud \
--reservation-affinity=any \
--zone=europe-west1-b \
...
access the machine and finalize the CUDA installation. Rember to enable port-forwarding for the jupyter notebook
gcloud compute ssh --zone "europe-west1-b" "deleteme-gpu-1" --project "esco-test" -- -NL 8081:localhost:8081
checkout the project and install the requirements
git clone https://github.com/par-tec/esco-playground.git
cd esco-playground
pip install -r requirements-dev.txt -r requirements.txt