ropenscilabs / tif

Text Interchange Formats
https://docs.ropensci.org/tif
35 stars 4 forks source link

Is column order in a corpus data frame important? #7

Closed lmullen closed 6 years ago

lmullen commented 6 years ago

The recommendation for a corpus data frame is currently to have column 1 be the doc_id and column 2 to be the text, with additional metadata being optional. Is it really necessary to enforce the column order. It seems more important to have the columns with the proper names, with no need to specify their order. For example, one could imagine loading a CSV of metadata, then adding a column of texts, with no real gain.

If this restriction is removed, the checking functions just need to remove that check, and the coercion functions have to rely on column names instead of column indexes.

statsmaths commented 6 years ago

We changed the forced ordering of columns for the tokens object, so I think this would be reasonable and consistent. Can anyone think of a reason not to do this?

kbenoit commented 6 years ago

I don't think we need to enforce column ordering, just names. Indexing a matrix-like object by numeric values alone is always worse than by name if the name is known and available.

statsmaths commented 6 years ago

Okay, I'll take that as a consensus. Fixed now with bd1182c6c1ddcd2eaa26b44224ed6a150e9596a7.