quanteda / readtext

an R package for reading text files
https://readtext.quanteda.io
120 stars 28 forks source link

Add support for Apache Tika types #43

Open kbenoit opened 7 years ago

kbenoit commented 7 years ago

From quanteda issue #380:

Apache Tika (https://tika.apache.org/) might be useful. The KNIME folks just added that to their text mining nodes.

Thanks @BobMuenchen.

kbenoit commented 7 years ago

Added to the issue: a script contributed by Arthur Stenzel (thanks Arthur!).

# TIKA Script
# Andreas Niekler <aniekler [at] informatik.uni-leipzig.de>
# Gregor Wiedemann <gregor.wiedemann [at] uni-leipzig.de>
# ===========

# Define function to extract text with Tika
# Tika Java Archive has to be copied to working directory, current version: tika-app-1.15.jar
tikaExtractTextFromFile <- function(file, sourceFolder, targetFolder){

  command <- paste0("java -jar tika-app-1.15.jar --text ", sourceFolder, file)
  output <- system(command, intern = TRUE) # execute Tika via shell
  output <- iconv(output, to = "UTF-8")
  fileConn<-file(paste0(targetFolder, file, ".txt"), encoding = "UTF-8")
  writeLines(output, fileConn)
  close(fileConn)

}

# define input folder
sourceFolder <- "./data_X/"
myFiles <- list.files(path = sourceFolder, pattern = NULL, # use pattern argument for specific file types, e.g. PDF: "pdf$"
                      full.names = FALSE, recursive = FALSE,
                      include.dirs = FALSE)
# define output folder
targetFolder <- "./data_X_txt/"

# iterate over files in input folder, extract text and save to output folder
for (filename in myFiles) {
  cat("Extracting from ", filename, "...\n")
  tikaExtractTextFromFile(filename, sourceFolder = sourceFolder, targetFolder = targetFolder)
}