neomatrix369 / nlp_profiler

A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column.
Other
243 stars 37 forks source link

Add phrase counts or parts-of-speech token counts after extracting entities from a sentence #15

Open neomatrix369 opened 4 years ago

neomatrix369 commented 4 years ago

On the back of the PR #13, it appears there are other types of phrase i.e. pronouns, or dates or organisations etc... - the details can be discussed. So far we have achieved these and there are a number of others to cover:

Name entity recognition features:

Parts of speech features:

See https://spacy.io/api/annotation#section-named-entities and http://www.nltk.org/book/ for details on the above items.

We will replace one or more existing functionalities in the libraries with the above, case-by-case basis. It would be best to group each of them and give them unique names like name-entity-recognition-features and parts-of-speech-features, respectively and club them with granular features.

Both NLTK and Spacey would be used to fulfill these functionalities.