Closed kbenoit closed 6 years ago
That's a good point and also conflicts with my current structure. It makes much more sense to include all of the primary keys all the way to left... Given that we say the matrix must be a data frame and the token column must be called token
, perhaps there is no need to specify which columns (numerically) the keys actually correspond to?
I'd agree with that. How about:
data.frame
doc_id
token
doc_id
, starting at 1:
sentence_id
token_id
paragraph_id
name_id
format and repeat within doc_id
I silently agreed with your proposed changes, and have even been implicitly assuming this was the standard, but never actually incorporated them into the package. Fixed now with e1acb02bb54cf9cf6e7acfcb4c9061ae64884426.
Follow-up: We’ve designed the readtext and spacyr packages to be agnostic as to the text analysis framework, so that any package can benefit from them. We’re happy to take input about format/compliance/anything else in case other package authors think we could do this even better.
I also have a set of spacyr and readtext functions in quanteda designed to extend those packages in a quanteda-specific fashion, and could illustrate the system of re-exports required to extend quanteda methods for the other package’s object classes, or vice versa by extending the other package’s generics using methods for quanteda objects.
Two issues, related:
Would we want to define optional values, such as integer
sentence_id
or eventoken_id
, to denote the sentence serial number within document and token serial number within sentence? Any reason why these ought to be unique across the token set?If we do define additional
_id
fields, would we want to loose the definition to be that "doc_id" comes first, followed by optional_id
variables, followed bytokens
?I ask because we want to define the parsed spacy object structure in spacyr to conform, but it currently looks like this (in development):