Closed kbenoit closed 7 years ago
These would include:
[ ] dependency parsing, which would probably return something like, for "I solved the problem with statistics.":
yields
1 I _ PRP PRP _ 2 SUB
2 solved _ VBD VBD _ 0 ROOT
3 the _ DT DT _ 4 NMOD
4 problem _ NN NN _ 2 OBJ
5 with _ IN IN _ 2 VMOD
6 statistics _ NNS NNS _ 5 PMOD
7 . _ . . _ 2 P
(the formatting is explained more here: http://ilk.uvt.nl/conll/#dataformat)
I have made significant changes to the dev_rpython
branch and coded several functions to implement the points above.
In the new implementation, initialize_spacy_rpython()
create a new object of spacyr() class in python and all works are supposed to be done in that object. In the new version, texts are handed to python through parse_spacy()
in R, and this function call will conduct all parsing including named entity recognition and dependency parsing. parse_spacy()
will return an S3 object (class "spacy_out") to point to the locations of parsed documents in python. This object is later used to get tokens, tokenization with tagging, etc. All functions implemented today except for parse_spacy()
will requires a "spacy_out" object.
Named entity recognition
There are two function at the moment: all_entities()
returns a list of all named entities, get_entities()
returns the results of entity recognition for each tokens. Entry of the returned object is like ORG_B
, ORG
is the type of named entity, B
indicates the beginning of the entity (another value is I
which indicates "in the entity.")
Simple tokenization
get_tokens()
Tokenization with tagging
tag_new()
functions pretty similar to the original tag()
function. This tag_new()
internally calls two functions get_tokens()
and get_tags()
.
Dependency parsing
get_dependency_data()
. This function returns a dataframe of dependency data. I could not find an spacy tokens attributes corresponding to FEATS
so the field is ommitted.
Things to consider
.entity()
in spaCy is called separately after tokenization. Howver, current parse_spacy()
is obviously overkill if the goal is only tokenization (and tagging). Probably I will set up a just_tokenization
option later. My plan is that if the documents are processed with just_tokenization
flag, additional functionalities (e.g. tagger, entity recognition and dependency parsing) will be conducted when relevant functions are called.**_entities()
functions' outputs.dev_rpython
branch passes all checks. You can merge it to the master branch.Brilliant! I look forward to testing it out.
I will probably change some function names, but we can discuss first. Thanks Aki!!
- I am not sure what the best way to show the results of **_entities() functions' outputs.
how about we start all functions with :
spacy_initialize()
- this will just call the latest one, which seems fastest and best. We can get rid of the others. spacy_parse()
- new name for parse_spacy()
tokens.spacy_out()
- uses a spacy_out
class object to create an extended quanteda tokens object class object, that also includes the pointer to the parsed spacyr object tokens_tag.spacy_out()
- adds pos to the special tokens
class object tokens_named_entities()
- adds named entities to the special tokens
class object tokens_dependencies()
- adds dependency data to the special tokens
class object This way, the package can extend the quanteda tokens()
constructor, which currently is defined for corpus
and character
class inputs. It would also be consistent with the new quanteda API, since the tokens_*
functions input a tokens
object and return a tokens
objects. Here we have extended the constructor, and then defined additional methods to return an augmented tokens
object.
In the current GitHub version of quanteda, there are functions such as as.list.tokens()
that return the sort of list-of-characters tokens provided by your get_tokens()
.
I am assuming that the pointer to the parsed spaCy object is not permanent, so the functions would need to check the state of the pointer and if not current, refuse the command.
- Dependency parcing output. How I construct a dataframe might be inefficient, you can have a look.
Yes but it's perfect (except for indexing from zero - not R-like!). On efficiency, I can fix that in the more efficient, indexed (hashed) version of tokens
we re-designed for quanteda. We can also write some extractor function to produce your data.frame, if that is what people want. How we store it and what we extract don't have to be the same.
- Currently dev_rpython branch passes all checks. You can merge it to the master branch.
Let's prune some functions first, settle on the names after some discussion, and then we can merge.
Figure out what other options are available in spaCy and build in options to call them.
tag()