Closed lmullen closed 6 years ago
I think a named list is a reasonable choice to store a corpus object. I'm not sure why we settled on a (possible named) list rather than a character vector as one of the two formats. At the same time, I like keeping the number of formats to a minimum. The original idea was to have just one format, but due to the large list/data-frame divide we opted for two. So, I could see going either way with this one.
Of course, the tif-format doesn't stop you from accepting other input corpora object.
I was flipping the two formats in the comment above (it has been a while!).
In terms of a named list, I don't think that there is currently enough usage of that format to require that everyone support it to be tif compliant. Its a reasonable thing to do as a one-off of course, but I don't think we should make support of it mandatory.
Okay, that sounds reasonable to me.
My memory is fuzzy so pardon me if I've got this completely wrong.
Should a named list where each the names are document IDs and each element of the list is document which is a character vector of length 1 be a valid form of a corpus?
That's a format that tokenizers accepts, but perhaps that is idiosyncratic. If there is no compelling reason to include this then perhaps it should be left out for simplicity.
The main reason I think it might be useful is that it might be better suited to parallel processing, but that's just a hunch.