Open kbenoit opened 6 years ago
This should parse out the filepaths, not filepaths and filenames.
> (rt3 <- readtext(paste0(DATA_DIR, "txt/movie_reviews/*"), + docvarsfrom = "filepaths", docvarnames = "sentiment")) readtext object consisting of 10 documents and 4 docvars. # data.frame [10 × 6] doc_id text sentiment docvar2 docvar3 docvar4 <chr> <chr> <chr> <chr> <chr> <chr> 1 neg_cv000_2… "\"plot : t… /Library/Frameworks/R.framework/Versions/3.… reviews/n… cv000 29416.… 2 neg_cv001_1… "\"the happ… /Library/Frameworks/R.framework/Versions/3.… reviews/n… cv001 19502.… 3 neg_cv002_1… "\"it is mo… /Library/Frameworks/R.framework/Versions/3.… reviews/n… cv002 17424.… 4 neg_cv003_1… "\" \" ques… /Library/Frameworks/R.framework/Versions/3.… reviews/n… cv003 12683.… 5 neg_cv004_1… "\"synopsis… /Library/Frameworks/R.framework/Versions/3.… reviews/n… cv004 12641.… 6 pos_cv000_2… "\"films ad… /Library/Frameworks/R.framework/Versions/3.… reviews/p… cv000 29590.… # ... with 4 more rows Warning message: In get_docvars_filenames(files, dvsep, docvarnames, docvarsfrom == : Fewer docnames supplied than existing docvars - last 3 docvars given generic names.
The idea behind the docvarsfrom = "filepaths" is not to parse the filenames, but rather to take as docvars the folder parts from the supplied file pattern matches.
docvarsfrom = "filepaths"
So in the example:
DATA_DIR <- system.file("extdata/", package = "readtext") # recurse through subdirectories (rt3 <- readtext(paste0(DATA_DIR, "txt/movie_reviews/*"), docvarsfrom = "filepaths", docvarnames = "sentiment"))
it should return:
readtext object consisting of 10 documents and 1 docvar. # data.frame [10 × 3] doc_id text sentiment <chr> <chr> <chr> 1 neg_cv000_29416.txt "\"plot : two\"..." neg 2 neg_cv001_19502.txt "\"the happy \"..." neg 3 neg_cv002_17424.txt "\"it is movi\"..." neg 4 neg_cv003_12683.txt "\" \" quest f\"..." neg 5 neg_cv004_12641.txt "\"synopsis :\"..." neg 6 pos_cv000_29590.txt "\"films adap\"..." pos # ... with 4 more rows
where the neg, pos labels come not from filenames but instead from the path at the match level, e.g. the pre-/ part of:
/
> list.files(path = paste0(DATA_DIR, "txt/movie_reviews/"), recursive = TRUE) [1] "neg/neg_cv000_29416.txt" "neg/neg_cv001_19502.txt" "neg/neg_cv002_17424.txt" [4] "neg/neg_cv003_12683.txt" "neg/neg_cv004_12641.txt" "pos/pos_cv000_29590.txt" [7] "pos/pos_cv001_18431.txt" "pos/pos_cv002_15918.txt" "pos/pos_cv003_11664.txt" [10] "pos/pos_cv004_11636.txt"
When docvarsfrom = "filepaths" the filenames should not be parsed into dvars.
The root cause is that Sys.glob() does not tell us what in file paths "*" matched. https://github.com/quanteda/readtext/blob/555aa7222c255a0cde3e17e983dede0e240857f5/R/utils.R#L164
Sys.glob()
Error
This should parse out the filepaths, not filepaths and filenames.
Expected behaviour
The idea behind the
docvarsfrom = "filepaths"
is not to parse the filenames, but rather to take as docvars the folder parts from the supplied file pattern matches.So in the example:
it should return:
where the neg, pos labels come not from filenames but instead from the path at the match level, e.g. the pre-
/
part of:When
docvarsfrom = "filepaths"
the filenames should not be parsed into dvars.