Open kbenoit opened 7 years ago
Added to the issue: a script contributed by Arthur Stenzel (thanks Arthur!).
# TIKA Script
# Andreas Niekler <aniekler [at] informatik.uni-leipzig.de>
# Gregor Wiedemann <gregor.wiedemann [at] uni-leipzig.de>
# ===========
# Define function to extract text with Tika
# Tika Java Archive has to be copied to working directory, current version: tika-app-1.15.jar
tikaExtractTextFromFile <- function(file, sourceFolder, targetFolder){
command <- paste0("java -jar tika-app-1.15.jar --text ", sourceFolder, file)
output <- system(command, intern = TRUE) # execute Tika via shell
output <- iconv(output, to = "UTF-8")
fileConn<-file(paste0(targetFolder, file, ".txt"), encoding = "UTF-8")
writeLines(output, fileConn)
close(fileConn)
}
# define input folder
sourceFolder <- "./data_X/"
myFiles <- list.files(path = sourceFolder, pattern = NULL, # use pattern argument for specific file types, e.g. PDF: "pdf$"
full.names = FALSE, recursive = FALSE,
include.dirs = FALSE)
# define output folder
targetFolder <- "./data_X_txt/"
# iterate over files in input folder, extract text and save to output folder
for (filename in myFiles) {
cat("Extracting from ", filename, "...\n")
tikaExtractTextFromFile(filename, sourceFolder = sourceFolder, targetFolder = targetFolder)
}
From quanteda issue #380:
Thanks @BobMuenchen.