ropenscilabs / tif

Text Interchange Formats
https://docs.ropensci.org/tif
35 stars 4 forks source link

List as corpus? #8

Closed lmullen closed 6 years ago

lmullen commented 6 years ago

My memory is fuzzy so pardon me if I've got this completely wrong.

Should a named list where each the names are document IDs and each element of the list is document which is a character vector of length 1 be a valid form of a corpus?

That's a format that tokenizers accepts, but perhaps that is idiosyncratic. If there is no compelling reason to include this then perhaps it should be left out for simplicity.

The main reason I think it might be useful is that it might be better suited to parallel processing, but that's just a hunch.

statsmaths commented 6 years ago

I think a named list is a reasonable choice to store a corpus object. I'm not sure why we settled on a (possible named) list rather than a character vector as one of the two formats. At the same time, I like keeping the number of formats to a minimum. The original idea was to have just one format, but due to the large list/data-frame divide we opted for two. So, I could see going either way with this one.

Of course, the tif-format doesn't stop you from accepting other input corpora object.

statsmaths commented 6 years ago

I was flipping the two formats in the comment above (it has been a while!).

In terms of a named list, I don't think that there is currently enough usage of that format to require that everyone support it to be tif compliant. Its a reasonable thing to do as a one-off of course, but I don't think we should make support of it mandatory.

lmullen commented 6 years ago

Okay, that sounds reasonable to me.