hrbrmstr / docxtractr

:scissors: Extract Tables from Microsoft Word Documents with R
Other
174 stars 29 forks source link

doc-file #5

Closed brry closed 8 years ago

brry commented 8 years ago

I have several .doc files that each (unzipped) only contain "[Content_Types].xml" and the folders "_rels" (with ".rls") and "theme" (with "theme/theme1.xml", "theme/themeManager.xml" and "theme/_rels/themeManager.xml.rels").

Any idea how to read the old ".doc" format? (I hope it's OK to post this as an issue. Just delete it if not^^)

hrbrmstr commented 8 years ago

If you have LibreOffice (not OpenOffice) installed, you can do something like (this is an OS X command-line):

/Applications/LibreOffice.app/Contents/MacOS/soffice --convert-to docx:"MS Word 2007 XML" filename.doc --headless 

which will convert .doc files to .docx. I believe Windows requires single dashes (-) vs double dashes (--) for the cmd line param options.

It'd be somewhat straightforward for me to write a function to identify whether libreoffice is installed on a given system (win/mac/linux) and then perform this conversion if a .doc is detected (it's going to be a while before I get to that tho).

This may also work with OpenOffice but the last time I tried the soffice command in headless mode with OpenOffice it failed miserably.

hrbrmstr commented 8 years ago

NOTES

probable linux locations

probable macOS locations

probable Windows locations

shld work on linux/macOS

shld work on Windows

boksic1986 commented 8 years ago

Dose it can change the table of a docx file?

hrbrmstr commented 8 years ago

@boksic1986 If you're asking if the package can modify the contents of a table in a Microsoft Word document, it cannot. If you desire this functionality, please ask file a new issue with some specifics on what you are looking for.

ChrisMuir commented 6 years ago

I know this is an old issue, but I had a work need to use this package for both .docx and .doc files, so I've started making revisions in a fork to make it work for both file types. Figured I'd share here, I'd be happy to open a PR if you'd like (or you can use any pieces/parts if you'd like). See my latest commit for all edits.

I'm using the LibreOffice software to convert .doc to .docx, as suggested in this thread. I'm working on Windows so currently the edits are only suited to work on Windows (and it's on the user to figure out their file path of soffice.exe and register it using set_libreoffice_path()). I'd love to expand the functionality though, to work on Mac and Linux. I'm testing using some local files and two urls, everything is working great on my machine:

set_libreoffice_path("C:\\Program Files\\LibreOffice\\program\\soffice.exe")

paths <- c(
  "http://www.sxscjg.gov.cn/module/download/downfile.jsp?classid=0&filename=1709151519250301478.docx", 
  "http://www.sxscjg.gov.cn/module/download/downfile.jsp?classid=0&filename=1708211847068712082.doc"
)

for (path in paths) {
  tbl <- docxtractr::read_docx(path)
  df <- docxtractr::docx_extract_tbl(tbl, 1)
  print(dim(df))
  Sys.sleep(5)
}
#> [1] 69 10
#> [1] 15 10

If I make any more progress I can update here.

Also, just want to say this package is AWESOME! About a year ago I had a need to get data from both .docx and .doc files, I resorted to using Python and the win32com module to extract all content as a string, and then piece the data tables back together.....it was kind of a nightmare. So glad I found this, thank you for building it!

hrbrmstr commented 6 years ago

wow (like, srsly: wow!) Definitely add yourself as an aut+ctb into the DESCRIPTION and shoot a PR over. I can poke at cross-platform bits (I have all three OSes but only rarely fire up Windows and always have libreoffice installed for forensic-tool-purposes) and also any CRAN issues that might be there due to a dep on libreoffice. This def needs to get on CRAN as I think alot of folks are feeling similar pain.

This is great! #ty!

ChrisMuir commented 6 years ago

Sure thing! Just opened a PR.

I plan on working on this some more, I'll let you know if I make more progress. If there's anything in particular you'd like help with on this pkg let me know, I'm happy to help!

ChrisMuir commented 6 years ago

I worked on this some more, I have it up and running on my Mac for both .docx and .doc files (see my commit ecbf2a3). I'm basically just splitting the guts of convert_doc_to_docx() into two functions, convert_win() and convert_osx(), and then using Sys.info() to determine if the os is Windows or not.

I don't have much experience with calling command line tools within an R package, so I'm not sure if my implementation choices are the best.

Again, let me know if you want a PR for these edits 😄

# Same test on a Mac
docxtractr::set_libreoffice_path("/Applications/LibreOffice.app/Contents/MacOS/soffice")

paths <- c(
  "http://www.sxscjg.gov.cn/module/download/downfile.jsp?classid=0&filename=1709151519250301478.docx", 
  "http://www.sxscjg.gov.cn/module/download/downfile.jsp?classid=0&filename=1708211847068712082.doc"
)

for (path in paths) {
  tbl <- docxtractr::read_docx(path)
  df <- docxtractr::docx_extract_tbl(tbl, 1)
  print(dim(df))
  Sys.sleep(5)
}
#> [1] 69 10
#> [1] 15 10
ChrisMuir commented 6 years ago

I've been thinking about how to "automatically" determine the path to soffice. I've looked around for similar set ups in other packages, but every example I can find involves utilizing software that has its own PATH variable, so it's easy to just use Sys.which() to get the software file path.

We could do a manual search (example below), but that doesn't seem ideal. Is something like this what you had in mind?

# Check env variable "path_to_libreoffice". If it's NULL, call lo_find(), which
# will try to determine the local path to LibreOffice file "soffice". If 
# lo_find() is successful, the path to "soffice" will be assigned to env 
# variable "path_to_libreoffice", otherwise an error is thrown.
lo_assert <- function() {
  lo_path <- getOption("path_to_libreoffice")

  if (is.null(lo_path)) {
    lo_path <- lo_find()
    set_libreoffice_path(lo_path)
  }
}

# Returns the local path to LibreOffice file "soffice". Search is performed by 
# looking in the known file locations for the current OS. If OS is not Linux, 
# OSX, or Windows, an error is thrown. If path to "soffice" is not found, an 
# error is thrown.
lo_find <- function() {
  user_os <- Sys.info()["sysname"]
  if (!user_os %in% names(lo_paths_to_check)) {
    stop(lo_path_missing, call. = FALSE)
  }

  lo_path <- NULL
  for (path in lo_paths_to_check[[user_os]]) {
    if (file.exists(path)) {
      lo_path <- path
      break
    }
  }

  if (is.null(lo_path)) {
    stop(lo_path_missing, call. = FALSE)
  }

  lo_path
}

# List obj containing known locations of LibreOffice file "soffice".
lo_paths_to_check <- list(
  "Linux" = c("/usr/bin/soffice",
              "/usr/local/bin/soffice"),
  "Darwin" = c("/Applications/LibreOffice.app/Contents/MacOS/soffice",
               "~/Applications/LibreOffice.app/Contents/MacOS/soffice"),
  "Windows" = c("C:\\Program Files\\LibreOffice\\program\\soffice.exe",
                "C:\\progra~1\\libreo~1\\program\\soffice.exe")
)

# Error message thrown if LibreOffice file "soffice" cannot be found.
lo_path_missing <- paste(
  "LibreOffice software required to read '.doc' files.",
  "Cannot determine file path to LibreOffice.",
  "To download LibreOffice, visit: https://www.libreoffice.org/ \n",
  "If you've already downloaded the software, use function",
  "'set_libreoffice_path()' to point R to your local 'soffice.exe' file"
)

And then lo_assert() could be inserted near the top of read_docx(), like so:

## <snip at line 25>
# Check to see if input is a .doc file
is_input_doc <- is_doc(path)

# If input is a .doc file, create a temp .doc file
if (is_input_doc) {
  lo_assert()
  tmpf_doc <- tempfile(tmpdir = tmpd, fileext = ".doc")
  tmpf_docx <- gsub("\\.doc$", ".docx", tmpf_doc)
} else {
  tmpf_doc <- NULL
  tmpf_docx <- NULL
}
## <continue with function>
bedantaguru commented 5 years ago

I was thinking of an alternative way of supporting doc-files. I opened https://github.com/hrbrmstr/docxtractr/issues/23 for it.