shon-otmazgin / fastcoref

MIT License
142 stars 25 forks source link

Edge case with multiple custom components. #13

Closed radandreicristian closed 1 year ago

radandreicristian commented 1 year ago

Hi!

I have an edge case that involves a language pipeline with fastcoref and other custom components.

In my case, I have en_core_web_md, fastcoref and custom_component. Let's say custom_component depends on the parser.

In fascoref/utilities/util.py, line 116 is expected (I think) to disable all the components except the tokenizer, which you use then to tokenise the text.

However, if you do not consider other custom components, you end up with a pipeline with a tokenizer and custom components.

You end up having a pipeline like ["tok2vec", "custom_comp"], where custom_component depends on the parser, it crashes when running the custom in the call on line 116 because you have disabled the component on which the custom component depends on.

I think a quick solution would be to inspect the pipe for custom components and disable them.

If you are busy, I can make a PR on this sometimes the next days.

Thanks a lot!

shon-otmazgin commented 1 year ago

Hi @radandreicristian

I am not sure I understood. My intentions are: The coref pkg needs a spacy component to tokenize the raw text regardless if the user using the pkg's spacy integration. So, insted of constructing a new spacy object inside the coref class, if you using spacy, we can utilize it and inject the nlp object into the coref class. Then inside the coref class, we disable all components except the tokenizer in a particular pipe call. does it disable it to future calls as well?

this is the injection in the spacy components self.coref_model = FCoref(model_name_or_path = model_path, device=device, nlp=nlp)

inside the coref class we are checking if you injected or not a nlp object

if nlp is not None:
    self.nlp = nlp
else:
    try:
        self.nlp = spacy.load("en_core_web_sm", exclude=["tagger", "parser", "lemmatizer", "ner", "textcat"])

then, I just disable it using the pipe call docs = nlp.pipe(texts, disable=["tagger", "parser", "lemmatizer", "ner", "textcat", "fastcoref"])

radandreicristian commented 1 year ago

I understand.

What I am saying is that when the pipeline is injected into the coref component, you do not consider that there could be other custom components, which you do not disable when you tokenized with the injected nlp.

I opened a pull request for that, maybe it makes more sense when you see on the code.

shon-otmazgin commented 1 year ago

Got it!

Very elegant solution, thanks for your contribution, I will merge it and release a new PyPI version soon(~1 hour).

radandreicristian commented 1 year ago

Thanks for the fast response and update @shon-otmazgin , much appreciated!

I'll close this issue now as it's been fixed version 2.0.3.

Cheers!

shon-otmazgin commented 1 year ago

DONE.

pip install --upgrade fastcoref or pip install fastcoref==2.0.3 thank you again