quanteda / spacyr

R wrapper to spaCy NLP
http://spacyr.quanteda.io
250 stars 38 forks source link

Expand options available through posTag.py #1

Closed kbenoit closed 7 years ago

kbenoit commented 8 years ago

Figure out what other options are available in spaCy and build in options to call them.

  1. In the Python code
  2. In the R wrapper function tag()
kbenoit commented 8 years ago

These would include:

amatsuo commented 8 years ago

I have made significant changes to the dev_rpython branch and coded several functions to implement the points above.

In the new implementation, initialize_spacy_rpython() create a new object of spacyr() class in python and all works are supposed to be done in that object. In the new version, texts are handed to python through parse_spacy() in R, and this function call will conduct all parsing including named entity recognition and dependency parsing. parse_spacy() will return an S3 object (class "spacy_out") to point to the locations of parsed documents in python. This object is later used to get tokens, tokenization with tagging, etc. All functions implemented today except for parse_spacy() will requires a "spacy_out" object.

Named entity recognition There are two function at the moment: all_entities() returns a list of all named entities, get_entities() returns the results of entity recognition for each tokens. Entry of the returned object is like ORG_B, ORG is the type of named entity, B indicates the beginning of the entity (another value is I which indicates "in the entity.")

Simple tokenization get_tokens()

Tokenization with tagging tag_new() functions pretty similar to the original tag() function. This tag_new() internally calls two functions get_tokens() and get_tags().

Dependency parsing get_dependency_data(). This function returns a dataframe of dependency data. I could not find an spacy tokens attributes corresponding to FEATS so the field is ommitted.

Things to consider

  1. I've also tried another way to set documents in spacy, where only tokenization is implemented and additional processing is done separately. This worked fine and much quicker if the purpose is just tokanization, but the result of named entity recognition was slightly different (and seems less accurate) if .entity() in spaCy is called separately after tokenization. Howver, current parse_spacy() is obviously overkill if the goal is only tokenization (and tagging). Probably I will set up a just_tokenization option later. My plan is that if the documents are processed with just_tokenization flag, additional functionalities (e.g. tagger, entity recognition and dependency parsing) will be conducted when relevant functions are called.
  2. I am not sure what the best way to show the results of **_entities() functions' outputs.
  3. Dependency parcing output. How I construct a dataframe might be inefficient, you can have a look.
  4. Currently dev_rpython branch passes all checks. You can merge it to the master branch.
kbenoit commented 8 years ago

Brilliant! I look forward to testing it out.

I will probably change some function names, but we can discuss first. Thanks Aki!!

kbenoit commented 8 years ago
  1. I am not sure what the best way to show the results of **_entities() functions' outputs.

how about we start all functions with :

This way, the package can extend the quanteda tokens() constructor, which currently is defined for corpus and character class inputs. It would also be consistent with the new quanteda API, since the tokens_* functions input a tokens object and return a tokens objects. Here we have extended the constructor, and then defined additional methods to return an augmented tokens object.

In the current GitHub version of quanteda, there are functions such as as.list.tokens() that return the sort of list-of-characters tokens provided by your get_tokens().

I am assuming that the pointer to the parsed spaCy object is not permanent, so the functions would need to check the state of the pointer and if not current, refuse the command.

  1. Dependency parcing output. How I construct a dataframe might be inefficient, you can have a look.

Yes but it's perfect (except for indexing from zero - not R-like!). On efficiency, I can fix that in the more efficient, indexed (hashed) version of tokens we re-designed for quanteda. We can also write some extractor function to produce your data.frame, if that is what people want. How we store it and what we extract don't have to be the same.

  1. Currently dev_rpython branch passes all checks. You can merge it to the master branch.

Let's prune some functions first, settle on the names after some discussion, and then we can merge.