quanteda / readtext

an R package for reading text files
https://readtext.quanteda.io
120 stars 28 forks source link

Add docid_field #156

Closed kbenoit closed 5 years ago

kbenoit commented 5 years ago

Adds a docid_field to readtext(), which adds this functionality for .csv, .tsv, .xls(x), and .ods.

There is no default value, as requested in #155, because it only makes sense for spreadsheet-like inputs and because text_id also has no default.

Note: The branch is misnamed!

amatsuo commented 5 years ago

Is there a reason not to add this option to get_json()? I think it makes more sense to have it.

kbenoit commented 5 years ago

Agreed, it does make sense. The other question: Should we automatically recognize the quanteda::corpus.data.frame() defaults? i.e. docid_field = "doc_id", text_field = "text"?

This only makes sense for multi-document inputs - so is not an active default for single-document inputs that do not contain key-value pairs or column headers - but we could indicate that clearly in the documentation. (You can't have a one-function-does-all approach and have every argument make sense for every input.)

amatsuo commented 5 years ago

Should we automatically recognize the quanteda::corpus.data.frame() defaults? i.e. docid_field = "doc_id", text_field = "text"?

I was thinking about it. I'd say it would be good to send a message about it such that "doc_id field exists in the file. If you intend to use it as a document identifier, use docid_field option." Auto-recognition might be confusing.

For json, I will implement it later today.

kbenoit commented 5 years ago

Sounds good, pls make both changes. See the function I used for setting docid_field in utils.R. Once the JSON has become a data.frame I think we can use the same function, at the end of get_json().