Compatibility with non-Hugging Face libraries

NimaBoscarino commented 1 year ago

(As suggested in #8)

This issue may be split into multiple issues if needed.

Pandas
SpaCy
etc.

Ideally the API should support these kinds of things out of the box, so this is a matter of verifying that they work and then documenting them, or making small changes to be compatible if needed. If it's absolutely necessary, we can consider making special methods to bridge things.

davidberenstein1957 commented 1 year ago

I just made a proposal for the new API structure. For now, it is structured as a standalone that processes documents using spaCy and lateron infers knowledge based on the pre-processed docs. However, I can also see it being set-up as a fully integrated spaCy pipeline component but this would require rewriting the code a bit more.

Note that the current set-up only works for Gender, however, we need to iterative over it a bit.

from disaggregators import Disaggregator

text = ["She, the woman, went to the park."]

disaggregator = Disaggregator("gender", language="en_core_web_md")

doc = disaggregator(text[0])
print(doc.spans["sc"])
print(doc.cats)
docs = disaggregator.pipe(text)

new features:

pipe for batch processing
call and pipe are only called once and processed (tokenized etc.) docs are shared.
call and pipe do not need to be shared if components are set individually (we need to define something to merge individual docs per module)
predictions are split in recognized potentially overlapping Spans doc.spans["sc"] and multi-label classifications doc.cats

things to consider:

instead of docs we can also just return list of spans and classes
semi-standalone vs. spaCy components per module
word-matched vs fuzzymatches vs pattern matches vs semantic matches
expand word sets over word2vec
weighted counter for classification over number of hits n_i/sum(nij)
TODOs in code

NimaBoscarino commented 1 year ago

Awesome, thank you so much for this! Jotting a couple thoughts down here:

Since not all DisaggregationModules will be built with spaCy, I think it would make sense to make something like a SpacyDisaggregationModule that subclasses DisaggregationModule, to encapsulate the spaCy-specific API, and that could make the Gender disaggregator simpler.
The language param should probably live inside of the DisaggregationModuleConfig (or even something like SpacyDisaggregationModuleConfig), and should ideally not be passed through the DisaggregationModuleFactory's create_module, since it wouldn't be relevant for non-spaCy disaggregators.
What does sc stand for?
To keep the output for each disaggregator simple + easy to work with, IMO all modules should return something in the form of a dict with keys of <DisaggregationModuleLabel.VALUE>: True/False. So for additional contextual info (e.g doc.spans["sc"]) I think it would be good to have that as a secondary return value, e.g. maybe SpacyDisaggregationModules can return a Tuple with something like: ResultDict, ResultContext.

I have some more thoughts which I'll write up soon – thanks again for doing this!!! 🤗

davidberenstein1957 commented 1 year ago

Hi, thanks for the input! I will re-write the code a bit during this week.

Since not all DisaggregationModules will be built with spaCy, I think it would make sense to make something like a SpacyDisaggregationModule that subclasses DisaggregationModule, to encapsulate the spaCy-specific API, and that could make the Gender disaggregator simpler.

I assumed that we also wanted to optimize for speed, and reducing the dependency overhead so I opted for using spacy as a default tokenizer and pre-processor (not having to do this for each document for each potential module). For now, I will assume that flexibility and adaptability go over speed and efficiency.

The language param should probably live inside of the DisaggregationModuleConfig (or even something like SpacyDisaggregationModuleConfig), and should ideally not be passed through the DisaggregationModuleFactory's create_module, since it wouldn't be relevant for non-spaCy disaggregators.

I wanted to re-use this language param whenever possible while still allowing for using custom language configs per module.

What does sc stand for?

This is a spacy specific thing for handling overlapping spans, which enables direct visualization with display.

To keep the output for each disaggregator simple + easy to work with, IMO all modules should return something in the form of a dict with keys of : True/False. So for additional contextual info (e.g doc.spans["sc"]) I think it would be good to have that as a secondary return value, e.g. maybe SpacyDisaggregationModules can return a Tuple with something like: ResultDict, ResultContext.

I agree with this. However, I limited the generalizability due to wanting to wrap the modules within the spaCy eco-system, but as mentioned I will re-factor the code to allow for more flexibility.

huggingface / disaggregators

Compatibility with non-Hugging Face libraries #10