Open hansthompson opened 6 years ago
@hansthompson is your repo public by the way? I am not involved in pdftools
development but am curious. I wonder how you've been extracting images, and how it complements pdftools
, rtika
and tesseract
.
Besides, I know @sokal1456 had been looking for such a tool (not for tables, for figures, but nested as images within PDFs).
I made a project in 2014 called pdfharvester in R that I have not maintained but is on github. My aim was to crowd source conversion of image elements in pdfs of tables to data frames. It has not been maintained and I would redo the whole thing now because it didn't work with Rstudio at the time, used tcltk, and is currently broken.
But one fact of solving the problem is that images within pdfs are important elements that need to be extracted and I created a wrapper function for the command line tool, pdfimages to do that. Although it was one of my first larger projects and pretty clunky.
It looks like pdftools::pdf_convert, converts the whole page with all elements to an image file which is different than extracting the element in the pdf.
I hadn't seen or used rtika before but it looks like image extraction is available the way I'm looking for however it may be better to make a non-java version imo?
I was wrapping the command line tool tesseract for project before the tesseract r package came out for the purpose of OCRing the table cell images I made. I don't see anything in the package that does the conversion to the image in tesseract though. That's why I used pdfimages.
Hi,
My name is Dinda. I am a researcher. I need to go through a lot of Pdf files regularly. I am looking for a code which will extract images or schemes and tables from Pdf files. Please help me if you already have any such code either in R or python. Thank you.
Regards,
Dinda
Here's a mostly-R painful hack at parsing PDFs for images. It only works sometimes, but it would be great if someone could improve this.
library(stringr)
library(hexView)
.PROBLEM <- function(type, aMessage) {
newMessage <- paste0("problem",
type,
" in ",
as.list(sys.call(-1))[[1]],
"(): ",
aMessage,
".")
if(type == "error") stop(newMessage, call. = FALSE)
message(newMessage)
}
isPDF <- function(aFileName,
verbose = TRUE) {
fileContents <- suppressWarnings(try(
readLines(aFileName, n = 5, ok = TRUE, warn = FALSE),
silent = TRUE))
# error catch when downloading PDFs online
if(inherits(fileContents, "try-error")) {
if(verbose == TRUE) {
message(paste0("Failed validation, with error: ", fileContents[1]))
} else {
if(verbose == "quiet") return(FALSE)
# when verbose = FALSE, secret message intended for PDF_downloader success
message(" poor internet connection,", appendLF = FALSE)
}
return(FALSE)
}
return(any(!is.na(str_extract(fileContents, ".*%PDF-1.*"))))
}
#' Attempts to extract all images from a PDF
#'
#' Tries to extract images within a PDF file. Currently does not support
#' decoding of images in CCITT compression formats. However, will still save
#' these images to file; as a record of the number of images detected in the PDF.
#'
#' @param file The file name and location of a PDF file. Prompts
#' for file name if none is explicitly called.
#'
#' @return A vector of file names saved as images.
#'
#' @export PDF_extractImages
PDF_extractImages <- function(file = file.choose()) {
# check if file is a PDF
if(isPDF(file, verbose = "quiet") != TRUE) {
.PROBLEM("error",
"file not PDF format")
}
# read file in HEX also ASCII
rawFile <- hexView::readRaw(file, human = "char")
# test if read file characters is same as file size
if (length(hexView::blockValue(rawFile)) != file.info(file)$size) {
.PROBLEM("error",
"possible size reading error of PDF")
}
# extract images embeded as PDF objects
createdFiles_bin <- scanPDFobjects(rawFile, file)
#if(quiet != TRUE) message(paste0(createdFiles_bin), " ")
# extract images embeded in XML
createdFiles_XML <- scanPDFXML(rawFile, file)
#if(quiet != TRUE) message(paste0(createdFiles_XML), " ")
theSavedFileNames <- c(createdFiles_bin, createdFiles_XML)
#print(round(7/3) + 7 %% 3)
#if(ignore != TRUE) {
#
# par(mfrow = c(2,3), las = 1)
# for(i in 1:6) {
# figure_display(theSavedFileNames[i])
# mtext(theSavedFileNames[i], col = "red", cex = 1.2)
# }
#}
return(theSavedFileNames)
}
scanPDFobjects <- function (rawFile, file) {
# collapse ASCII to a single string
theStringFile <- paste(hexView::blockValue(rawFile), collapse = '')
# split string by PDF objects and keep delimiter
theObjects <- paste(unlist(strsplit(theStringFile, "endobj")), "endobj", sep="")
# identify and screen candidate objects with images
candidateObjects <- c(which(str_extract(theObjects, "XObject/Width") == "XObject/Width"),
which(str_extract(theObjects, "Image") == "Image"))
removeObjects <- c(which(str_extract(theObjects, "PDF/Text") == "PDF/Text"),
which(str_extract(theObjects, "PDF /Text") == "PDF /Text"))
candidateObjects <- unique(candidateObjects[! candidateObjects %in% removeObjects])
if(length(candidateObjects) == 0) {
return("No PDF image objects detected.")
}
# generate file names for candidate images
fileNames <- paste(rep(tools::file_path_sans_ext(file), length(candidateObjects)),
"_bin_", 1:length(candidateObjects), ".jpg", sep="")
# extract and save all image binaries found in PDF
theNewFiles <- sapply(1:length(candidateObjects),
function(x, y, z) PDFobjectToImageFile(y[x],
theObjects,
file,
z[x]),
y = candidateObjects, z = fileNames)
return(theNewFiles)
}
PDFobjectToImageFile <- function (objectLocation,
theObjects,
theFile,
imageFileName) {
# parse object by stream & endstream
parsedImageObject <- unlist(strsplit(theObjects[objectLocation], "stream"))
# extract key char locations of image in PDF with trailingChars as a correction
# for "stream" being followed by 2 return characters
trailingChars <- " "
startImageLocation <- nchar(paste(parsedImageObject[1],
"stream", trailingChars, sep = ""))
endImageLocation <- startImageLocation +
nchar(substr(parsedImageObject[2],
1,
nchar(parsedImageObject[2]) - nchar("end")))
PDFLocation <- nchar(paste(theObjects[1:(objectLocation - 1)], collapse = ''))
# extract binary of image from PDF
PDFImageBlock <- hexView::readRaw(theFile,
offset = PDFLocation + startImageLocation,
nbytes = endImageLocation, machine = "binary")
# sometimes some of the orginal file format unicode is missing, this helps clean
# this issue for jpgs at least
if((PDFImageBlock$fileRaw[1] == "d8") && (PDFImageBlock$fileRaw[2] == "ff"))
PDFImageBlock$fileRaw <- c(as.raw('0xff'), PDFImageBlock$fileRaw)
# save binary of image to new file
detectedImageFile <- file(imageFileName, "wb")
writeBin(PDFImageBlock$fileRaw, detectedImageFile)
close(detectedImageFile)
# TO DO RETURN INFO ABOUT SUCCESSFUL FILE SAVE
return(imageFileName)
}
scanPDFXML <- function (rawFile, file) {
# collapse ASCII to a single string
theStringFile <- paste(hexView::blockValue(rawFile), collapse = '')
# split by XML tags with images and keep delimiter
theObjects <- paste(unlist(strsplit(theStringFile, "xmpGImg:image>")),
"xmpGImg:image>", sep="")
# identify objects with images
candidateObjects <- which(str_extract(theObjects, "</xmpGImg:image>") == "</xmpGImg:image>")
if(length(candidateObjects) == 0) {
return("No XML image objects detected.")
}
# generate file names for candidate images
fileNames <- paste(rep(tools::file_path_sans_ext(file), length(candidateObjects)),
"_XML_", 1:length(candidateObjects), ".jpg", sep="")
# extract and save all image binaries found in PDF
theNewFiles <- sapply(1:length(candidateObjects),
function(x, y, z) PDFXMLToImageFile(y[x],
theObjects,
file,
z[x]),
y = candidateObjects, z = fileNames)
return(theNewFiles)
}
PDFXMLToImageFile <- function (objectLocation,
theObjects,
theFile,
imageFileName) {
# parse encoded XML image and clean
parsedImage <- unlist(strsplit(theObjects[objectLocation], "</xmpGImg:image>"))
parsedImage <- gsub("
", "", parsedImage[1])
# decode image to base64
decodedImage <- RCurl::base64Decode(parsedImage, "raw")
# save binary of image to new file
detectedImageFile <- file(imageFileName, "wb")
writeBin(decodedImage, detectedImageFile)
close(detectedImageFile)
# TO DO RETURN INFO ABOUT SUCCESSFUL FILE SAVE
return(imageFileName)
}
PDF_extractImages()
Just saw @sckott star his package https://github.com/sckott/pdfimager
if anyone wants to contribute over there please do, or do your own thing :)
Thank you all. I have not tried it yet but pdfimager looks like what I was doing (but without my cludge). If this means closing this issue I would be fine with that.
I would like to move my work extracting images from pdfs from calling pdfimages to your package. Would you be interested in including this command to an R function?
I feel this would open up more opportunities for analysis of images that are within pdfs. My goal is to get better analysis of tables that are scanned copies transferred into pdf. #