Closed brry closed 8 years ago
If you have LibreOffice (not OpenOffice) installed, you can do something like (this is an OS X command-line):
/Applications/LibreOffice.app/Contents/MacOS/soffice --convert-to docx:"MS Word 2007 XML" filename.doc --headless
which will convert .doc
files to .docx
. I believe Windows requires single dashes (-
) vs double dashes (--
) for the cmd line param options.
It'd be somewhat straightforward for me to write a function to identify whether libreoffice is installed on a given system (win/mac/linux) and then perform this conversion if a .doc
is detected (it's going to be a while before I get to that tho).
This may also work with OpenOffice but the last time I tried the soffice
command in headless mode with OpenOffice it failed miserably.
/usr/bin/soffice
/usr/local/bin/soffice
/Applications/LibreOffice.app/Contents/MacOS/soffice
~/Applications/LibreOffice.app/Contents/MacOS/soffice
C:\Program Files\LibreOffice #.#\program\soffice.exe
C:\progra~1\libreo~1\program\soffice.exe
soffice --convert-to docx:"MS Word 2007 XML" --headless --outdir (somedir) filename.doc
soffice -convert-to docx:"MS Word 2007 XML" -headless -outdir (somedir) filename.doc
Dose it can change the table of a docx file?
@boksic1986 If you're asking if the package can modify the contents of a table in a Microsoft Word document, it cannot. If you desire this functionality, please ask file a new issue with some specifics on what you are looking for.
I know this is an old issue, but I had a work need to use this package for both .docx
and .doc
files, so I've started making revisions in a fork to make it work for both file types. Figured I'd share here, I'd be happy to open a PR if you'd like (or you can use any pieces/parts if you'd like). See my latest commit for all edits.
I'm using the LibreOffice software to convert .doc
to .docx
, as suggested in this thread. I'm working on Windows so currently the edits are only suited to work on Windows (and it's on the user to figure out their file path of soffice.exe
and register it using set_libreoffice_path()
). I'd love to expand the functionality though, to work on Mac and Linux. I'm testing using some local files and two urls, everything is working great on my machine:
set_libreoffice_path("C:\\Program Files\\LibreOffice\\program\\soffice.exe")
paths <- c(
"http://www.sxscjg.gov.cn/module/download/downfile.jsp?classid=0&filename=1709151519250301478.docx",
"http://www.sxscjg.gov.cn/module/download/downfile.jsp?classid=0&filename=1708211847068712082.doc"
)
for (path in paths) {
tbl <- docxtractr::read_docx(path)
df <- docxtractr::docx_extract_tbl(tbl, 1)
print(dim(df))
Sys.sleep(5)
}
#> [1] 69 10
#> [1] 15 10
If I make any more progress I can update here.
Also, just want to say this package is AWESOME! About a year ago I had a need to get data from both .docx
and .doc
files, I resorted to using Python and the win32com module to extract all content as a string, and then piece the data tables back together.....it was kind of a nightmare. So glad I found this, thank you for building it!
wow (like, srsly: wow!) Definitely add yourself as an aut+ctb into the DESCRIPTION and shoot a PR over. I can poke at cross-platform bits (I have all three OSes but only rarely fire up Windows and always have libreoffice installed for forensic-tool-purposes) and also any CRAN issues that might be there due to a dep on libreoffice. This def needs to get on CRAN as I think alot of folks are feeling similar pain.
This is great! #ty!
Sure thing! Just opened a PR.
I plan on working on this some more, I'll let you know if I make more progress. If there's anything in particular you'd like help with on this pkg let me know, I'm happy to help!
I worked on this some more, I have it up and running on my Mac for both .docx
and .doc
files (see my commit ecbf2a3). I'm basically just splitting the guts of convert_doc_to_docx()
into two functions, convert_win()
and convert_osx()
, and then using Sys.info()
to determine if the os is Windows or not.
I don't have much experience with calling command line tools within an R package, so I'm not sure if my implementation choices are the best.
Again, let me know if you want a PR for these edits 😄
# Same test on a Mac
docxtractr::set_libreoffice_path("/Applications/LibreOffice.app/Contents/MacOS/soffice")
paths <- c(
"http://www.sxscjg.gov.cn/module/download/downfile.jsp?classid=0&filename=1709151519250301478.docx",
"http://www.sxscjg.gov.cn/module/download/downfile.jsp?classid=0&filename=1708211847068712082.doc"
)
for (path in paths) {
tbl <- docxtractr::read_docx(path)
df <- docxtractr::docx_extract_tbl(tbl, 1)
print(dim(df))
Sys.sleep(5)
}
#> [1] 69 10
#> [1] 15 10
I've been thinking about how to "automatically" determine the path to soffice
. I've looked around for similar set ups in other packages, but every example I can find involves utilizing software that has its own PATH variable, so it's easy to just use Sys.which()
to get the software file path.
We could do a manual search (example below), but that doesn't seem ideal. Is something like this what you had in mind?
# Check env variable "path_to_libreoffice". If it's NULL, call lo_find(), which
# will try to determine the local path to LibreOffice file "soffice". If
# lo_find() is successful, the path to "soffice" will be assigned to env
# variable "path_to_libreoffice", otherwise an error is thrown.
lo_assert <- function() {
lo_path <- getOption("path_to_libreoffice")
if (is.null(lo_path)) {
lo_path <- lo_find()
set_libreoffice_path(lo_path)
}
}
# Returns the local path to LibreOffice file "soffice". Search is performed by
# looking in the known file locations for the current OS. If OS is not Linux,
# OSX, or Windows, an error is thrown. If path to "soffice" is not found, an
# error is thrown.
lo_find <- function() {
user_os <- Sys.info()["sysname"]
if (!user_os %in% names(lo_paths_to_check)) {
stop(lo_path_missing, call. = FALSE)
}
lo_path <- NULL
for (path in lo_paths_to_check[[user_os]]) {
if (file.exists(path)) {
lo_path <- path
break
}
}
if (is.null(lo_path)) {
stop(lo_path_missing, call. = FALSE)
}
lo_path
}
# List obj containing known locations of LibreOffice file "soffice".
lo_paths_to_check <- list(
"Linux" = c("/usr/bin/soffice",
"/usr/local/bin/soffice"),
"Darwin" = c("/Applications/LibreOffice.app/Contents/MacOS/soffice",
"~/Applications/LibreOffice.app/Contents/MacOS/soffice"),
"Windows" = c("C:\\Program Files\\LibreOffice\\program\\soffice.exe",
"C:\\progra~1\\libreo~1\\program\\soffice.exe")
)
# Error message thrown if LibreOffice file "soffice" cannot be found.
lo_path_missing <- paste(
"LibreOffice software required to read '.doc' files.",
"Cannot determine file path to LibreOffice.",
"To download LibreOffice, visit: https://www.libreoffice.org/ \n",
"If you've already downloaded the software, use function",
"'set_libreoffice_path()' to point R to your local 'soffice.exe' file"
)
And then lo_assert()
could be inserted near the top of read_docx()
, like so:
## <snip at line 25>
# Check to see if input is a .doc file
is_input_doc <- is_doc(path)
# If input is a .doc file, create a temp .doc file
if (is_input_doc) {
lo_assert()
tmpf_doc <- tempfile(tmpdir = tmpd, fileext = ".doc")
tmpf_docx <- gsub("\\.doc$", ".docx", tmpf_doc)
} else {
tmpf_doc <- NULL
tmpf_docx <- NULL
}
## <continue with function>
I was thinking of an alternative way of supporting doc-files. I opened https://github.com/hrbrmstr/docxtractr/issues/23 for it.
I have several .doc files that each (unzipped) only contain "[Content_Types].xml" and the folders "_rels" (with ".rls") and "theme" (with "theme/theme1.xml", "theme/themeManager.xml" and "theme/_rels/themeManager.xml.rels").
Any idea how to read the old ".doc" format? (I hope it's OK to post this as an issue. Just delete it if not^^)