ropensci / pdftools

Text Extraction, Rendering and Converting of PDF Documents
https://docs.ropensci.org/pdftools
Other
524 stars 71 forks source link

Feature Request: extract images from pdfs #23

Open hansthompson opened 6 years ago

hansthompson commented 6 years ago

I would like to move my work extracting images from pdfs from calling pdfimages to your package. Would you be interested in including this command to an R function?

I feel this would open up more opportunities for analysis of images that are within pdfs. My goal is to get better analysis of tables that are scanned copies transferred into pdf. #

maelle commented 6 years ago

@hansthompson is your repo public by the way? I am not involved in pdftools development but am curious. I wonder how you've been extracting images, and how it complements pdftools, rtika and tesseract.

Besides, I know @sokal1456 had been looking for such a tool (not for tables, for figures, but nested as images within PDFs).

hansthompson commented 6 years ago

I made a project in 2014 called pdfharvester in R that I have not maintained but is on github. My aim was to crowd source conversion of image elements in pdfs of tables to data frames. It has not been maintained and I would redo the whole thing now because it didn't work with Rstudio at the time, used tcltk, and is currently broken.

But one fact of solving the problem is that images within pdfs are important elements that need to be extracted and I created a wrapper function for the command line tool, pdfimages to do that. Although it was one of my first larger projects and pretty clunky.

It looks like pdftools::pdf_convert, converts the whole page with all elements to an image file which is different than extracting the element in the pdf.

I hadn't seen or used rtika before but it looks like image extraction is available the way I'm looking for however it may be better to make a non-java version imo?

I was wrapping the command line tool tesseract for project before the tesseract r package came out for the purpose of OCRing the table cell images I made. I don't see anything in the package that does the conversion to the image in tesseract though. That's why I used pdfimages.

yindinda commented 5 years ago

Hi, My name is Dinda. I am a researcher. I need to go through a lot of Pdf files regularly. I am looking for a code which will extract images or schemes and tables from Pdf files. Please help me if you already have any such code either in R or python. Thank you.
Regards, Dinda

dcaud commented 2 years ago

Here's a mostly-R painful hack at parsing PDFs for images. It only works sometimes, but it would be great if someone could improve this.

library(stringr)
library(hexView)

.PROBLEM <- function(type, aMessage) {

  newMessage <- paste0("problem",
                       type,
                       " in ",
                       as.list(sys.call(-1))[[1]],
                       "(): ",
                       aMessage,
                       ".")
  if(type == "error") stop(newMessage, call. = FALSE)
  message(newMessage)

}

isPDF <- function(aFileName,
                  verbose = TRUE) {

  fileContents <- suppressWarnings(try(
    readLines(aFileName, n = 5, ok = TRUE, warn = FALSE),
    silent = TRUE))

  # error catch when downloading PDFs online
  if(inherits(fileContents, "try-error")) {
    if(verbose == TRUE) {
      message(paste0("Failed validation, with error: ", fileContents[1]))
    } else {
      if(verbose ==  "quiet") return(FALSE)
      # when verbose = FALSE, secret message intended for PDF_downloader success
      message(" poor internet connection,", appendLF = FALSE)
    }
    return(FALSE)
  }

  return(any(!is.na(str_extract(fileContents, ".*%PDF-1.*"))))
}

#' Attempts to extract all images from a PDF
#'
#' Tries to extract images within a PDF file.  Currently does not support
#' decoding of images in CCITT compression formats. However, will still save
#' these images to file; as a record of the number of images detected in the PDF.
#'
#' @param file The file name and location of a PDF file.  Prompts
#'    for file name if none is explicitly called.
#'
#' @return A vector of file names saved as images.
#'
#' @export PDF_extractImages

PDF_extractImages <- function(file = file.choose()) {

  # check if file is a PDF
  if(isPDF(file, verbose = "quiet") != TRUE) {
    .PROBLEM("error",
                     "file not PDF format")
  }

  # read file in HEX also ASCII
  rawFile <- hexView::readRaw(file, human = "char")

  # test if read file characters is same as file size
  if (length(hexView::blockValue(rawFile)) != file.info(file)$size) {
    .PROBLEM("error",
                     "possible size reading error of PDF")
  }

  # extract images embeded as PDF objects
  createdFiles_bin <- scanPDFobjects(rawFile, file)
  #if(quiet != TRUE) message(paste0(createdFiles_bin), " ")

  # extract images embeded in XML
  createdFiles_XML <- scanPDFXML(rawFile, file)
  #if(quiet != TRUE) message(paste0(createdFiles_XML), " ")

  theSavedFileNames <- c(createdFiles_bin, createdFiles_XML)

  #print(round(7/3) + 7 %% 3)
  #if(ignore != TRUE) {
  #
  #  par(mfrow = c(2,3), las = 1)
  #  for(i in 1:6) {
  #    figure_display(theSavedFileNames[i])
  #    mtext(theSavedFileNames[i], col = "red", cex = 1.2)
  #  }
  #}

  return(theSavedFileNames)
}

scanPDFobjects <- function (rawFile, file) {

  # collapse ASCII to a single string
  theStringFile <- paste(hexView::blockValue(rawFile), collapse = '')

  # split string by PDF objects and keep delimiter
  theObjects <- paste(unlist(strsplit(theStringFile, "endobj")), "endobj", sep="")

  # identify and screen candidate objects with images
  candidateObjects <- c(which(str_extract(theObjects, "XObject/Width") == "XObject/Width"),
                        which(str_extract(theObjects, "Image") == "Image"))

  removeObjects <- c(which(str_extract(theObjects, "PDF/Text") == "PDF/Text"),
                     which(str_extract(theObjects, "PDF /Text") == "PDF /Text"))

  candidateObjects <- unique(candidateObjects[! candidateObjects %in% removeObjects])

  if(length(candidateObjects) == 0) {
    return("No PDF image objects detected.")
  }

  # generate file names for candidate images
  fileNames <- paste(rep(tools::file_path_sans_ext(file), length(candidateObjects)),
                     "_bin_", 1:length(candidateObjects), ".jpg", sep="")

  # extract and save all image binaries found in PDF
  theNewFiles <- sapply(1:length(candidateObjects),
                        function(x, y, z) PDFobjectToImageFile(y[x],
                                                               theObjects,
                                                               file,
                                                               z[x]),
                        y = candidateObjects, z = fileNames)

  return(theNewFiles)
}

PDFobjectToImageFile <- function (objectLocation,
                                  theObjects,
                                  theFile,
                                  imageFileName) {

  # parse object by stream & endstream
  parsedImageObject <-  unlist(strsplit(theObjects[objectLocation], "stream"))

  # extract key char locations of image in PDF with trailingChars as a correction
  # for "stream" being followed by 2 return characters
  trailingChars <- "  "
  startImageLocation <- nchar(paste(parsedImageObject[1],
                                    "stream", trailingChars, sep = ""))
  endImageLocation <- startImageLocation +
    nchar(substr(parsedImageObject[2],
                 1,
                 nchar(parsedImageObject[2]) - nchar("end")))
  PDFLocation <- nchar(paste(theObjects[1:(objectLocation - 1)], collapse = ''))

  # extract binary of image from PDF
  PDFImageBlock <- hexView::readRaw(theFile,
                                    offset = PDFLocation + startImageLocation,
                                    nbytes = endImageLocation, machine = "binary")

  # sometimes some of the orginal file format unicode is missing, this helps clean
  # this issue for jpgs at least
  if((PDFImageBlock$fileRaw[1] == "d8") && (PDFImageBlock$fileRaw[2] == "ff"))
    PDFImageBlock$fileRaw <- c(as.raw('0xff'), PDFImageBlock$fileRaw)

  # save binary of image to new file
  detectedImageFile <- file(imageFileName, "wb")
  writeBin(PDFImageBlock$fileRaw, detectedImageFile)
  close(detectedImageFile)

  # TO DO RETURN INFO ABOUT SUCCESSFUL FILE SAVE
  return(imageFileName)
}

scanPDFXML <- function (rawFile, file) {

  # collapse ASCII to a single string
  theStringFile <- paste(hexView::blockValue(rawFile), collapse = '')

  # split by XML tags with images and keep delimiter
  theObjects <- paste(unlist(strsplit(theStringFile, "xmpGImg:image>")),
                      "xmpGImg:image>", sep="")

  # identify objects with images
  candidateObjects <- which(str_extract(theObjects, "</xmpGImg:image>") == "</xmpGImg:image>")

  if(length(candidateObjects) == 0) {
    return("No XML image objects detected.")
  }

  # generate file names for candidate images
  fileNames <- paste(rep(tools::file_path_sans_ext(file), length(candidateObjects)),
                     "_XML_", 1:length(candidateObjects), ".jpg", sep="")

  # extract and save all image binaries found in PDF
  theNewFiles <- sapply(1:length(candidateObjects),
                        function(x, y, z) PDFXMLToImageFile(y[x],
                                                            theObjects,
                                                            file,
                                                            z[x]),
                        y = candidateObjects, z = fileNames)

  return(theNewFiles)
}

PDFXMLToImageFile <- function (objectLocation,
                               theObjects,
                               theFile,
                               imageFileName) {

  # parse encoded XML image and clean
  parsedImage <- unlist(strsplit(theObjects[objectLocation], "</xmpGImg:image>"))
  parsedImage <- gsub("&#xA;", "", parsedImage[1])

  # decode image to base64
  decodedImage <- RCurl::base64Decode(parsedImage, "raw")

  # save binary of image to new file
  detectedImageFile <- file(imageFileName, "wb")
  writeBin(decodedImage, detectedImageFile)
  close(detectedImageFile)

  # TO DO RETURN INFO ABOUT SUCCESSFUL FILE SAVE
  return(imageFileName)
}

PDF_extractImages()
maelle commented 2 years ago

Just saw @sckott star his package https://github.com/sckott/pdfimager

sckott commented 2 years ago

if anyone wants to contribute over there please do, or do your own thing :)

hansthompson commented 2 years ago

Thank you all. I have not tried it yet but pdfimager looks like what I was doing (but without my cludge). If this means closing this issue I would be fine with that.