Expand options available through posTag.py

kbenoit commented 8 years ago

Figure out what other options are available in spaCy and build in options to call them.

In the Python code
In the R wrapper function tag()

kbenoit commented 8 years ago

These would include:

[x] tokenisation with tagging (already implemented)
[ ] just plain tokenization
[ ] named entity recognition

[ ] dependency parsing, which would probably return something like, for "I solved the problem with statistics.":

yields 
1       I               _       PRP     PRP     _       2       SUB
2       solved          _       VBD     VBD     _       0       ROOT
3       the             _       DT      DT      _       4       NMOD
4       problem         _       NN      NN      _       2       OBJ
5       with            _       IN      IN      _       2       VMOD
6       statistics      _       NNS     NNS     _       5       PMOD
7       .               _       .       .       _       2       P

(the formatting is explained more here: http://ilk.uvt.nl/conll/#dataformat)

[ ] anything else, not listed here, that spaCy offers that we should also offer

amatsuo commented 8 years ago

I have made significant changes to the dev_rpython branch and coded several functions to implement the points above.

In the new implementation, initialize_spacy_rpython() create a new object of spacyr() class in python and all works are supposed to be done in that object. In the new version, texts are handed to python through parse_spacy() in R, and this function call will conduct all parsing including named entity recognition and dependency parsing. parse_spacy() will return an S3 object (class "spacy_out") to point to the locations of parsed documents in python. This object is later used to get tokens, tokenization with tagging, etc. All functions implemented today except for parse_spacy() will requires a "spacy_out" object.

Named entity recognition There are two function at the moment: all_entities() returns a list of all named entities, get_entities() returns the results of entity recognition for each tokens. Entry of the returned object is like ORG_B, ORG is the type of named entity, B indicates the beginning of the entity (another value is I which indicates "in the entity.")

Simple tokenization get_tokens()

Tokenization with tagging tag_new() functions pretty similar to the original tag() function. This tag_new() internally calls two functions get_tokens() and get_tags().

Dependency parsing get_dependency_data(). This function returns a dataframe of dependency data. I could not find an spacy tokens attributes corresponding to FEATS so the field is ommitted.

Things to consider

I've also tried another way to set documents in spacy, where only tokenization is implemented and additional processing is done separately. This worked fine and much quicker if the purpose is just tokanization, but the result of named entity recognition was slightly different (and seems less accurate) if .entity() in spaCy is called separately after tokenization. Howver, current parse_spacy() is obviously overkill if the goal is only tokenization (and tagging). Probably I will set up a just_tokenization option later. My plan is that if the documents are processed with just_tokenization flag, additional functionalities (e.g. tagger, entity recognition and dependency parsing) will be conducted when relevant functions are called.
I am not sure what the best way to show the results of **_entities() functions' outputs.
Dependency parcing output. How I construct a dataframe might be inefficient, you can have a look.
Currently dev_rpython branch passes all checks. You can merge it to the master branch.

kbenoit commented 8 years ago

Brilliant! I look forward to testing it out.

I will probably change some function names, but we can discuss first. Thanks Aki!!

kbenoit commented 8 years ago

I am not sure what the best way to show the results of **_entities() functions' outputs.

how about we start all functions with :

spacy_initialize() - this will just call the latest one, which seems fastest and best. We can get rid of the others.
spacy_parse() - new name for parse_spacy()
tokens.spacy_out() - uses a spacy_out class object to create an extended quanteda tokens object class object, that also includes the pointer to the parsed spacyr object
tokens_tag.spacy_out() - adds pos to the special tokens class object
tokens_named_entities() - adds named entities to the special tokens class object
tokens_dependencies() - adds dependency data to the special tokens class object

This way, the package can extend the quanteda tokens() constructor, which currently is defined for corpus and character class inputs. It would also be consistent with the new quanteda API, since the tokens_* functions input a tokens object and return a tokens objects. Here we have extended the constructor, and then defined additional methods to return an augmented tokens object.

In the current GitHub version of quanteda, there are functions such as as.list.tokens() that return the sort of list-of-characters tokens provided by your get_tokens().

I am assuming that the pointer to the parsed spaCy object is not permanent, so the functions would need to check the state of the pointer and if not current, refuse the command.

Dependency parcing output. How I construct a dataframe might be inefficient, you can have a look.

Yes but it's perfect (except for indexing from zero - not R-like!). On efficiency, I can fix that in the more efficient, indexed (hashed) version of tokens we re-designed for quanteda. We can also write some extractor function to produce your data.frame, if that is what people want. How we store it and what we extract don't have to be the same.

Currently dev_rpython branch passes all checks. You can merge it to the master branch.

Let's prune some functions first, settle on the names after some discussion, and then we can merge.

quanteda / spacyr

Expand options available through posTag.py #1