Open nakinnubis opened 2 years ago
@nakinnubis The authors haven't documented this functionality yet but it can be done.
The naive approach would be to make a DataFrame from your text string and then call score_docs() on the text string, then extract the appropriate information from the DataFrame that is returned from score_docs. This is slow, especially if you have lots of text strings, mainly because it re-instantiates the spacy engine for each string.
Instead, I would recommend the following. Create a spacy engine (the nlp variable) once, and reuse it on as many text strings as you have.
import spacy
from emfdscore.scoring import score_emfd_single_sent
text1 = "I care a lot about not hurting people or causing any bodily harm."
text2 = "It is important to respect authority figures like the government and police."
texts = [text1, text2]
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])
nlp.add_pipe("mfd_tokenizer")
nlp.add_pipe("score_emfd_single_sent", last=True)
# single string
result = nlp(text1)
# multiple strings
results = list(nlp.pipe(texts))
In my experience, where the naive method would have taken over an hour, this method takes less than one minute. This is on about 8000 strings of a few sentences each.
Please how possible is it to pass text string directly as input rather than csv