quanteda / readtext

an R package for reading text files
https://readtext.quanteda.io
120 stars 28 forks source link

Accept "." as dvsep delimiter #70

Closed stefan-mueller closed 7 years ago

stefan-mueller commented 7 years ago

I am aware that one should not include "." in filenames. However, I downloaded a large amount of txt files which have names such as xxxx.yyy.zzz.txt where each part (xxxx; zzz etc) contains information that should become a docvar in the corpus. I tried to use the following code to create doctors from the filename, but simply using dvsep = "." does not create the docvars.

text_example <- readtext(file = "xxx.yyy.zzz.txt", docvarsrom = "filenames", dvsep = ".")

Which regular expression do I need to insert so that the information in the file names are used as docvars? If we have a solution, I can amend the vignette and/or the manual, and describe this special case.

kbenoit commented 7 years ago

That's an interesting edge case.

If you load it without the docvarsfrom, you can always parse the docnames manually:

do.call(rbind, strsplit(docnames(text_example), "."))

or something like that

stefan-mueller commented 7 years ago

Thanks for the reply and solution. This works – but only if we use "[.]" instead of ".".

Working example:

text_example <- readtext(file = "var1_var2.var3.var4.txt") 
docvars_text <- do.call(rbind, strsplit(docnames(text_example), "[.]"))

corpus_example <- corpus(text_example)

docvars(corpus_example) <- docvars_text
adamobeng commented 7 years ago

I think do.call is not necessary, as long as you specify dvsep as a character range or an escaped character:

> readtext::readtext('/tmp/var1_var2.var3.var4.txt', docvarsfrom='filenames', dvsep='\\.')
                        text   docvar1 docvar2 docvar3
var1_var2.var3.var4.txt      var1_var2    var3    var4
> readtext::readtext('/tmp/var1_var2.var3.var4.txt', docvarsfrom='filenames', dvsep='[.]')
                        text   docvar1 docvar2 docvar3
var1_var2.var3.var4.txt      var1_var2    var3    var4