A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column.
Other
243
stars
37
forks
source link
Add phrase counts or parts-of-speech token counts after extracting entities from a sentence #15
On the back of the PR #13, it appears there are other types of phrase i.e. pronouns, or dates or organisations etc... - the details can be discussed. So far we have achieved these and there are a number of others to cover:
Name entity recognition features:
[ ] PERSON | People, including fictional.
[ ] NORP | Nationalities or religious or political groups.
[ ] FAC | Buildings, airports, highways, bridges, etc.
[ ] ORG | Companies, agencies, institutions, etc.
[ ] GPE | Countries, cities, states.
[ ] LOC | Non-GPE locations, mountain ranges, bodies of water.
[ ] PRODUCT | Objects, vehicles, foods, etc. (Not services.)
[ ] EVENT | Named hurricanes, battles, wars, sports events, etc.
[ ] WORK_OF_ART | Titles of books, songs, etc.
[ ] LAW | Named documents made into laws.
[ ] LANGUAGE | Any named language. (related to #4 feature request)
[ ] DATE | Absolute or relative dates or periods.
[ ] TIME | Times smaller than a day.
[ ] PERCENT | Percentage, including ”%“.
[ ] MONEY | Monetary values, including unit.
[ ] QUANTITY | Measurements, as of weight or distance.
[ ] ORDINAL | “first”, “second”, etc.
[ ] CARDINAL | Numerals that do not fall under another type.
Parts of speech features:
[X] (NOUN | noun | girl, cat, tree, air, beauty) Noun phrase count via #13 by @ritikjain51 and #47
We will replace one or more existing functionalities in the libraries with the above, case-by-case basis. It would be best to group each of them and give them unique names like name-entity-recognition-features and parts-of-speech-features, respectively and club them with granular features.
Both NLTK and Spacey would be used to fulfill these functionalities.
On the back of the PR #13, it appears there are other types of phrase i.e. pronouns, or dates or organisations etc... - the details can be discussed. So far we have achieved these and there are a number of others to cover:
Name entity recognition features:
Parts of speech features:
See https://spacy.io/api/annotation#section-named-entities and http://www.nltk.org/book/ for details on the above items.
We will replace one or more existing functionalities in the libraries with the above, case-by-case basis. It would be best to group each of them and give them unique names like
name-entity-recognition-features
andparts-of-speech-features
, respectively and club them with granular features.Both NLTK and Spacey would be used to fulfill these functionalities.