ropensci / pdftools

Text Extraction, Rendering and Converting of PDF Documents
https://docs.ropensci.org/pdftools
Other
524 stars 71 forks source link

Page orientation #28

Open anuraag94 opened 6 years ago

anuraag94 commented 6 years ago

First, I want to just say that this is fantastic package and has been extremely helpful, thank you.

I'm writing a parser to extract data from unstructured pdfs, and sometimes the pages are rotated 90 degrees. I'm aware that the mediabox stores properties like page width and page height, and with a few exceptions, I can back out the page orientation using that.

My question is whether accessing the mediabox is possible using the PDFTools package, or if you know of any other means I can do this within my R program? Any solution will be much appreciated!

trevorld commented 1 week ago
pdf_orientation <- function(input) {
    df <- pdftools::pdf_pagesize(input)
    ifelse(df$height < df$width, "landscape", "portrait")
}
pdf_gs <- function(input, output = NULL, ..., args = character(0L)) {
    input <- normalizePath(input)
    if (!length(output)) 
        output <- sub("\\.pdf$", "_output.pdf", input)
    output <- normalizePath(output, mustWork = FALSE)
    args <- c("-dBATCH",
              "-dNOPAUSE",
              "-sDEVICE=pdfwrite",
              "-sAutoRotatePages=None",
              paste0("-sOutputFile=", shQuote(output)),
              args,
              shQuote(input))
    cmd <- tools::find_gs_cmd()
    stdout <- system2(cmd, args, stdout = TRUE)
    invisible(output)
}
input |> pdf_gs() |> pdf_orientation()