medianeuroscience / emfdscore

Fast, flexible extraction of moral information from textual input data.
GNU General Public License v3.0
103 stars 29 forks source link

Passing text as input instead of csv #12

Open nakinnubis opened 2 years ago

nakinnubis commented 2 years ago

Please how possible is it to pass text string directly as input rather than csv

g-simmons commented 2 years ago

@nakinnubis The authors haven't documented this functionality yet but it can be done.

The naive approach would be to make a DataFrame from your text string and then call score_docs() on the text string, then extract the appropriate information from the DataFrame that is returned from score_docs. This is slow, especially if you have lots of text strings, mainly because it re-instantiates the spacy engine for each string.

Instead, I would recommend the following. Create a spacy engine (the nlp variable) once, and reuse it on as many text strings as you have.

import spacy
from emfdscore.scoring import score_emfd_single_sent

text1 = "I care a lot about not hurting people or causing any bodily harm."
text2 = "It is important to respect authority figures like the government and police."

texts = [text1, text2]

nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])
nlp.add_pipe("mfd_tokenizer")
nlp.add_pipe("score_emfd_single_sent", last=True)

# single string
result = nlp(text1)

# multiple strings
results = list(nlp.pipe(texts))

In my experience, where the naive method would have taken over an hour, this method takes less than one minute. This is on about 8000 strings of a few sentences each.