Can negspacy be used with already identified Entities from scispacy

jenojp / negspacy

spaCy pipeline object for negating concepts in text

MIT License

274 stars 36 forks source link

Can negspacy be used with already identified Entities from scispacy #16

Closed toltoxgh closed 3 years ago

toltoxgh commented 4 years ago

Is your feature request related to a problem? Please describe. Can negspacy be used with already identified Entities and their spans through scispacy, by providing them somehow?

Describe the solution you'd like

For instance, scispacy has been already run with its EntityLinker, and umls entities with their the indices have been obtained and stored somewhere.

It would be computationally expensive to run the whole scispacy with negspacy again. Is there a way to only run (sci)spacy with only base spacy functionality like the tokenizer, and provide the full text string, the entities and their indices somehow, so that negspacy can determine the negation status?

jenojp commented 4 years ago

Interesting question... how do you have the processed docs w/ entities and indices stored?

Raghu17s commented 4 years ago

Check out this ` nlp = en_core_sci_md.load()

negex = Negex(nlp, language = "en_clinical", chunk_prefix=["no"])

nlp.add_pipe(negex)

doc = nlp('your text')

for ent in doc.ents:

 print(ent, ent.label_, ent._.negex)

` Now filter your entities based on True and False in ent._.negex. It worked my problem. But it would be difficult for already extracted entities as you have lost the sentence context.

toltoxgh commented 4 years ago

Stored could be for example the texts themselves and the start,end indices of the entities in a csv like file.

Independent of how this is stored, once this info is read/parsed, would there be an option with negspacy to run with this information only and some basic spacy tokenizer/sentence splitter etc., without having to run the whole scispacy again?

jenojp commented 4 years ago

I'll leave this open in case anyone has ideas of how to build a spacy doc manually from this format of a cache of data and then running parts of a pipeline. I did some poking around and couldn't see an obvious way forward.

With the caveat of not knowing your use case entirely, I'd venture to guess that it might end up being more work than it's worth to get working the way you want instead of just rerunning and taking the computational hit.

jenojp commented 3 years ago

Closing due to lack of activity