urchade / GLiNER

Generalist and Lightweight Model for Named Entity Recognition (Extract any entity types from texts) @ NAACL 2024
https://arxiv.org/abs/2311.08526
Apache License 2.0
930 stars 79 forks source link

GliNER for Text Classification #104

Open wjbmattingly opened 1 month ago

wjbmattingly commented 1 month ago

Hi all,

I have been working on a few separate packages attached to GliNER. I have one that may be ready to share. It builds on GliNER spaCy and I could use advice on if you think this would be worth building into GliNER spaCy or packaging it as a separate component. This is a new spaCy component that needs to be added after a GliNER pipe. I am testing this on Holocaust related material.

It works like this. A user defines a set of categories as keys with list values that are GliNER labels. The user data would look like this.

{
    "family": ["child", "spouse", "marital"],
    "personal": ["person"],
    "locational": ["address", "office"],
    "health": ["mental health", "medical"],
    "labor": ["work", "job", "office"],
    "emotion": ["fear", "happy"],
    "education": ["school", "student"],
    "movement": ["travel", "from location", "to location", "leaving", "arriving"],
    "violence": ["violence"]
}

The goal here is to set GliNER to a rather low threshold and use nested spans to capture greater nuance. The new gliner_cat pipe adds up the values from the entities found for each sentence and assigns values to the categories based on this output. One can then process an entire document and identify where salient themes appear by chunking the document into a collection of sentences of n-length.

The component will generate the data and visualization for this.

image

This works rather like zero-shot text classification with a slight difference. It lets a user define a controlled NER vocabulary that aligns to a topic. This means that when a user wants to understand why certain categories appeared in the text, not only do they know which sentences have those topics, they can point to the specific entities in the sentence that generated that output.

@urchade and @tomaarsen if you like this, would you like to see it as part of GliNER spaCy or as a separate installable spaCy component? It does not have any other requirements except for seaborn for the viz.

urchade commented 1 month ago

Hi @wjbmattingly, It looks very interesting. Do you have a demo so that I can try it ?

wjbmattingly commented 1 month ago

@urchade you got it! https://github.com/theirstory/gliner-spacy/blob/main/examples/gliner_cat/gliner_cat_demo.ipynb I just pushed it to GitHub. You need to clone the repo and then run

python -m pip install .
urchade commented 1 month ago

Thanks, I will try it.

I was also think about training a gliner for zero-shot classification, by framing the task as span extraction

wjbmattingly commented 1 month ago

I tried that, but had a hard time consistently working in long spans. Maybe you will have better luck

urchade commented 1 month ago

With some fine-tuning it should work

btw, Ihor (@Ingvarstep) have made a multi-task version of GLiNER: https://huggingface.co/knowledgator/gliner-multitask-large-v0.5