Open NimaBoscarino opened 1 year ago
I just made a proposal for the new API structure. For now, it is structured as a standalone
that processes documents using spaCy and lateron infers knowledge based on the pre-processed docs. However, I can also see it being set-up as a fully integrated spaCy pipeline component but this would require rewriting the code a bit more.
Note that the current set-up only works for Gender
, however, we need to iterative over it a bit.
from disaggregators import Disaggregator
text = ["She, the woman, went to the park."]
disaggregator = Disaggregator("gender", language="en_core_web_md")
doc = disaggregator(text[0])
print(doc.spans["sc"])
print(doc.cats)
docs = disaggregator.pipe(text)
new features:
pipe
for batch processingcall
and pipe
are only called once and processed (tokenized etc.) docs are shared.call
and pipe
do not need to be shared if components are set individually (we need to define something to merge individual docs per module)doc.spans["sc"]
and multi-label classifications doc.cats
things to consider:
n_i/sum(nij)
TODO
s in codeAwesome, thank you so much for this! Jotting a couple thoughts down here:
DisaggregationModule
s will be built with spaCy, I think it would make sense to make something like a SpacyDisaggregationModule
that subclasses DisaggregationModule
, to encapsulate the spaCy-specific API, and that could make the Gender
disaggregator simpler.language
param should probably live inside of the DisaggregationModuleConfig
(or even something like SpacyDisaggregationModuleConfig
), and should ideally not be passed through the DisaggregationModuleFactory
's create_module
, since it wouldn't be relevant for non-spaCy disaggregators.sc
stand for?<DisaggregationModuleLabel.VALUE>: True/False
. So for additional contextual info (e.g doc.spans["sc"]
) I think it would be good to have that as a secondary return value, e.g. maybe SpacyDisaggregationModule
s can return a Tuple with something like: ResultDict, ResultContext
.I have some more thoughts which I'll write up soon – thanks again for doing this!!! 🤗
Hi, thanks for the input! I will re-write the code a bit during this week.
Since not all DisaggregationModules will be built with spaCy, I think it would make sense to make something like a SpacyDisaggregationModule that subclasses DisaggregationModule, to encapsulate the spaCy-specific API, and that could make the Gender disaggregator simpler.
I assumed that we also wanted to optimize for speed, and reducing the dependency overhead so I opted for using spacy as a default tokenizer and pre-processor (not having to do this for each document for each potential module). For now, I will assume that flexibility and adaptability go over speed and efficiency.
The language param should probably live inside of the DisaggregationModuleConfig (or even something like SpacyDisaggregationModuleConfig), and should ideally not be passed through the DisaggregationModuleFactory's create_module, since it wouldn't be relevant for non-spaCy disaggregators.
I wanted to re-use this language param whenever possible while still allowing for using custom language configs per module.
What does sc stand for?
This is a spacy specific thing for handling overlapping spans, which enables direct visualization with display.
To keep the output for each disaggregator simple + easy to work with, IMO all modules should return something in the form of a dict with keys of
: True/False. So for additional contextual info (e.g doc.spans["sc"]) I think it would be good to have that as a secondary return value, e.g. maybe SpacyDisaggregationModules can return a Tuple with something like: ResultDict, ResultContext.
I agree with this. However, I limited the generalizability due to wanting to wrap the modules within the spaCy eco-system, but as mentioned I will re-factor the code to allow for more flexibility.
(As suggested in #8)
This issue may be split into multiple issues if needed.
Ideally the API should support these kinds of things out of the box, so this is a matter of verifying that they work and then documenting them, or making small changes to be compatible if needed. If it's absolutely necessary, we can consider making special methods to bridge things.