quanteda / readtext

an R package for reading text files
https://readtext.quanteda.io
120 stars 28 forks source link

docvarsfrom = "filepaths" not working as expected #141

Open kbenoit opened 6 years ago

kbenoit commented 6 years ago

Error

This should parse out the filepaths, not filepaths and filenames.

> (rt3 <- readtext(paste0(DATA_DIR, "txt/movie_reviews/*"), 
+                  docvarsfrom = "filepaths", docvarnames = "sentiment"))
readtext object consisting of 10 documents and 4 docvars.
# data.frame [10 × 6]
  doc_id       text         sentiment                                    docvar2    docvar3 docvar4
  <chr>        <chr>        <chr>                                        <chr>      <chr>   <chr>  
1 neg_cv000_2… "\"plot : t… /Library/Frameworks/R.framework/Versions/3.… reviews/n… cv000   29416.…
2 neg_cv001_1… "\"the happ… /Library/Frameworks/R.framework/Versions/3.… reviews/n… cv001   19502.…
3 neg_cv002_1… "\"it is mo… /Library/Frameworks/R.framework/Versions/3.… reviews/n… cv002   17424.…
4 neg_cv003_1… "\" \" ques… /Library/Frameworks/R.framework/Versions/3.… reviews/n… cv003   12683.…
5 neg_cv004_1… "\"synopsis… /Library/Frameworks/R.framework/Versions/3.… reviews/n… cv004   12641.…
6 pos_cv000_2… "\"films ad… /Library/Frameworks/R.framework/Versions/3.… reviews/p… cv000   29590.…
# ... with 4 more rows
Warning message:
In get_docvars_filenames(files, dvsep, docvarnames, docvarsfrom ==  :
  Fewer docnames supplied than existing docvars - last 3 docvars given generic names.

Expected behaviour

The idea behind the docvarsfrom = "filepaths" is not to parse the filenames, but rather to take as docvars the folder parts from the supplied file pattern matches.

So in the example:

DATA_DIR <- system.file("extdata/", package = "readtext")
# recurse through subdirectories
(rt3 <- readtext(paste0(DATA_DIR, "txt/movie_reviews/*"), 
                 docvarsfrom = "filepaths", docvarnames = "sentiment"))

it should return:

readtext object consisting of 10 documents and 1 docvar.
# data.frame [10 × 3]
  doc_id              text                 sentiment
  <chr>               <chr>                <chr>    
1 neg_cv000_29416.txt "\"plot : two\"..."  neg      
2 neg_cv001_19502.txt "\"the happy \"..."  neg      
3 neg_cv002_17424.txt "\"it is movi\"..."  neg      
4 neg_cv003_12683.txt "\" \" quest f\"..." neg      
5 neg_cv004_12641.txt "\"synopsis :\"..."  neg      
6 pos_cv000_29590.txt "\"films adap\"..."  pos      
# ... with 4 more rows

where the neg, pos labels come not from filenames but instead from the path at the match level, e.g. the pre-/ part of:

> list.files(path = paste0(DATA_DIR, "txt/movie_reviews/"), recursive = TRUE)
 [1] "neg/neg_cv000_29416.txt" "neg/neg_cv001_19502.txt" "neg/neg_cv002_17424.txt"
 [4] "neg/neg_cv003_12683.txt" "neg/neg_cv004_12641.txt" "pos/pos_cv000_29590.txt"
 [7] "pos/pos_cv001_18431.txt" "pos/pos_cv002_15918.txt" "pos/pos_cv003_11664.txt"
[10] "pos/pos_cv004_11636.txt"

When docvarsfrom = "filepaths" the filenames should not be parsed into dvars.

koheiw commented 4 years ago

The root cause is that Sys.glob() does not tell us what in file paths "*" matched. https://github.com/quanteda/readtext/blob/555aa7222c255a0cde3e17e983dede0e240857f5/R/utils.R#L164