snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.78k stars 860 forks source link

Add support for other NLP preprocessors (CoreNLP, nltk) #1418

Open bhancock8 opened 5 years ago

bhancock8 commented 5 years ago

spaCy is great as a preprocessor for NLP labeling functions, but there are other libraries that individuals may want to use.

Ideally, we'd like to have wrappers for other packages as well, such as Stanford CoreNLP (https://stanfordnlp.github.io/stanfordnlp/) and NLTK (https://www.nltk.org/). We can pattern match on the SpacyPreprocessor. Then ultimately, give the nlp_labeling_function decorator a keyword argument where the user can specify which preprocessor to use.

cyrilou242 commented 4 years ago

Hello, I may have a need for this in the future. I may get some time to contribute.

I am not sure of what you mean by 'pattern matching' the SpacyPreprocessor: do we want to rebuild a spacy.Doc like object, to assure some compatibility at the tf definition level?

For instance: when defining an augmentation function like this for spacy :

spacy_proc = SpacyPreprocessor()
@transformation_function(pre=[spacy_proc])
def swap_adjectives(x):
    adjective_idxs = [i for i, token in enumerate(x.doc) if token.pos_ == "ADJ"]
    ... # modify adjectives
        return x

Let's say we want to replace spacy_proc by a nltk_proc or stanfordNLP_proc.

In any case nltk does not have a pipeline concept (correct me if I'am wrong), so it would have to be specified.
On the contrary, StanfordNLP does have a pipeline, and it is different from spacy's one.

Maybe you already have thought about the design of this. If think going with option 2 would be better to avoid having to maintain 'adaptators' to spacy Doc format, but switching between nlp proc will be difficult. I guess you would go for option 2, but I am a bit biased for option 1, because I'm building a repo of text transformation function, and being able to switch from one processors to another without breaking my tfs would be cool.

Let me know if there are any other important points to consider.

EDITED: I mistakenly switched option 1 and option 2 at the end of the last paragraph.

bhancock8 commented 4 years ago

Hi @cyrilou242, thanks for your post! I think you've outlined the tradeoffs well. Because each preprocessor produces potentially different fields, and even fields with the same high-level "type" (such as NER tags) can have different cardinalities and definitions for each tag between processors, I think option 2 is the safer choice, where in your function you'll need to use the field names specific to the preprocessor that you chose, and we assume that you've done your homework to understand what that field means.

If you're able to find the time to give this a shot, feel free to post intermediate thoughts here along the way so we can talk over any additional design considerations like the one you brought up; much easier to have these conversations early!

cyrilou242 commented 4 years ago

Thanks for the reply, I agree with option 2. I'll get back in some weeks, I got to dig a bit more into StanfordNLP 0.2.0 which is quite new.