ropensci / ozunconf18

repository for the rOpenSci ozunconference 2018
31 stars 7 forks source link

simplified NLP and text mining functions #16

Open mrjoh3 opened 6 years ago

mrjoh3 commented 6 years ago

I am thinking of a series of Natural Language Processing functions that take care of the pre-processing and allow the user to focus on the task and the output. This would be most useful where the objective is to extract something from a vector of text.

There are some common NLP or text-mining tasks to begin with could be entity extraction (people, places), keyword extraction and perhaps even topic modelling or text classification.

The functions would all be self-descriptive so: extract_place(), extract_people(), extract_topics() etc.

Inputs would be simple vectors of text and outputs a vector or list of the same length. So this could easily slot into a tidy workflow:

df %>%
  mutate(keywords = extract_keywords(text_column))

For me, the most complex part of NLP is the pre-processing. But I suspect (hope) it would be possible to setup a robust and generic process. And I think for 90% of use cases a generic pre-processing with only a few options would be sufficient.

The question I have is whether or not something like this already exists, I will check.

mrjoh3 commented 6 years ago

this project is underway and living at https://github.com/ropenscilabs/simpletextr