sentence_id for tokens data.frame?

kbenoit commented 7 years ago

Two issues, related:

Would we want to define optional values, such as integer sentence_id or even token_id, to denote the sentence serial number within document and token serial number within sentence? Any reason why these ought to be unique across the token set?
If we do define additional _id fields, would we want to loose the definition to be that "doc_id" comes first, followed by optional _id variables, followed by tokens?

I ask because we want to define the parsed spacy object structure in spacyr to conform, but it currently looks like this (in development):

    doc_id sentence_id token_id      token     lemma tag_detailed tag_google head_token_id  dep_rel named_entity
1    text1           1        1        Mr.       mr.          NNP      PROPN             3 compound             
2    text1           1        2    Winston   winston          NNP      PROPN             3 compound     PERSON_B
3    text1           1        3  Churchill churchill          NNP      PROPN             4    nsubj     PERSON_I
4    text1           1        4        ate       eat          VBD       VERB             4     ROOT             
5    text1           1        5        ten       ten           CD        NUM             6   nummod   CARDINAL_B
6    text1           1        6 sandwiches  sandwich          NNS       NOUN             4     dobj             
7    text1           1        7          .         .            .      PUNCT             4    punct             
8    text1           2        1          I    -PRON-          PRP       PRON             2    nsubj             
9    text1           2        2        had      have          VBD       VERB             2     ROOT             
10   text1           2        3        two       two           CD        NUM             2     dobj   CARDINAL_B
11   text1           2        4          .         .            .      PUNCT             2    punct

statsmaths commented 7 years ago

That's a good point and also conflicts with my current structure. It makes much more sense to include all of the primary keys all the way to left... Given that we say the matrix must be a data frame and the token column must be called token, perhaps there is no need to specify which columns (numerically) the keys actually correspond to?

kbenoit commented 7 years ago

I'd agree with that. How about:

the object class must be (or derive from) a data.frame
columns must include
- doc_id
- token
columns may include additional integer identifiers that will be unique within doc_id, starting at 1:
- sentence_id
- token_id
- paragraph_id
- any additional user-supplied ID variables should conform to the name_id format and repeat within doc_id
additional token-level metadata columns are allowed but not required

statsmaths commented 6 years ago

I silently agreed with your proposed changes, and have even been implicitly assuming this was the standard, but never actually incorporated them into the package. Fixed now with e1acb02bb54cf9cf6e7acfcb4c9061ae64884426.

kbenoit commented 6 years ago

Follow-up: We’ve designed the readtext and spacyr packages to be agnostic as to the text analysis framework, so that any package can benefit from them. We’re happy to take input about format/compliance/anything else in case other package authors think we could do this even better.

I also have a set of spacyr and readtext functions in quanteda designed to extend those packages in a quanteda-specific fashion, and could illustrate the system of re-exports required to extend quanteda methods for the other package’s object classes, or vice versa by extending the other package’s generics using methods for quanteda objects.

ropenscilabs / tif

sentence_id for tokens data.frame? #4