huggingface / disaggregators

🤗 Disaggregators: Curated data labelers for in-depth analysis.
Apache License 2.0
66 stars 3 forks source link

Compatibility with non-Hugging Face libraries #10

Open NimaBoscarino opened 1 year ago

NimaBoscarino commented 1 year ago

(As suggested in #8)

This issue may be split into multiple issues if needed.

Ideally the API should support these kinds of things out of the box, so this is a matter of verifying that they work and then documenting them, or making small changes to be compatible if needed. If it's absolutely necessary, we can consider making special methods to bridge things.

davidberenstein1957 commented 1 year ago

I just made a proposal for the new API structure. For now, it is structured as a standalone that processes documents using spaCy and lateron infers knowledge based on the pre-processed docs. However, I can also see it being set-up as a fully integrated spaCy pipeline component but this would require rewriting the code a bit more.

Note that the current set-up only works for Gender, however, we need to iterative over it a bit.

from disaggregators import Disaggregator

text = ["She, the woman, went to the park."]

disaggregator = Disaggregator("gender", language="en_core_web_md")

doc = disaggregator(text[0])
print(doc.spans["sc"])
print(doc.cats)
docs = disaggregator.pipe(text)

new features:

things to consider:

NimaBoscarino commented 1 year ago

Awesome, thank you so much for this! Jotting a couple thoughts down here:

I have some more thoughts which I'll write up soon – thanks again for doing this!!! 🤗

davidberenstein1957 commented 1 year ago

Hi, thanks for the input! I will re-write the code a bit during this week.

Since not all DisaggregationModules will be built with spaCy, I think it would make sense to make something like a SpacyDisaggregationModule that subclasses DisaggregationModule, to encapsulate the spaCy-specific API, and that could make the Gender disaggregator simpler.

I assumed that we also wanted to optimize for speed, and reducing the dependency overhead so I opted for using spacy as a default tokenizer and pre-processor (not having to do this for each document for each potential module). For now, I will assume that flexibility and adaptability go over speed and efficiency.

The language param should probably live inside of the DisaggregationModuleConfig (or even something like SpacyDisaggregationModuleConfig), and should ideally not be passed through the DisaggregationModuleFactory's create_module, since it wouldn't be relevant for non-spaCy disaggregators.

I wanted to re-use this language param whenever possible while still allowing for using custom language configs per module.

What does sc stand for?

This is a spacy specific thing for handling overlapping spans, which enables direct visualization with display.

To keep the output for each disaggregator simple + easy to work with, IMO all modules should return something in the form of a dict with keys of : True/False. So for additional contextual info (e.g doc.spans["sc"]) I think it would be good to have that as a secondary return value, e.g. maybe SpacyDisaggregationModules can return a Tuple with something like: ResultDict, ResultContext.

I agree with this. However, I limited the generalizability due to wanting to wrap the modules within the spaCy eco-system, but as mentioned I will re-factor the code to allow for more flexibility.