tidyverse / readxl

Read excel files (.xls and .xlsx) into R 🖇
https://readxl.tidyverse.org
Other
726 stars 195 forks source link

Feature Request redux - password protected files #688

Closed nfultz closed 2 years ago

nfultz commented 2 years ago

In #84 there was an initial discussion around password protected excel files, which was closed at the time because the microsoft implementation required cpp11 and (at that time) readxl did not. I believe that has changed.

I actually need this feature for a current client, and could be willing to put in some time and effort towards preparing a branch. At least it sounds less painful than going through a large folder, unlocking files one by one and doing File | Save As. Then again, I'm not super familiar with this codebase or that one.

Thanks much.

jennybc commented 2 years ago

Nitpick (but also kinda an important terminology point): the Microsoft implementation referred to in #84 requires C++11 not cpp11 (which is an R package).

And yes readxl now does explicitly declare that it uses the C++11 standard.

The compoundfilereader project doesn't really look like a "live", maintained project to me and, therefore, seems like a questionable foundation to build upon.

So I'm going to close this again.

I would, of course, be happy if we could read password-protected files. And wish you godspeed if you decide to work on it. Come back if you get something that works! Maybe the landscape has changed and there is some newer, maintained implementation that would help us. But I don't think this is enough of a credible near-term goal to keep an open issue for it.

nfultz commented 2 years ago

Fair enough. tldr; found a viable workaround

Since I spent an hour or so investigating, I'll leave some notes here, in case someone else hits this issue in the future, or future me wants to revisit.

tmp$unzip test.xlsx 
Archive:  test.xlsx
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of test.xlsx or
        test.xlsx.zip, and cannot find test.xlsx.ZIP, period.
tmp$~/projects/compoundfilereader/out/cfb info test.xlsx
file version: 3.62
difat sector: 0
directory sector: 0
fat sector: 3
mini fat sector: 1
tmp$~/projects/compoundfilereader/out/cfb list test.xlsx
EncryptionInfo
[DataSpaces]
    DataSpaceMap
    Version
    [DataSpaceInfo]
        StrongEncryptionDataSpace
    [TransformInfo]
        [StrongEncryptionTransform]
            Primary
EncryptedPackage

Anyways, if any one has a folder of several hundred password-ed sheets they have to deal with, you can run msoffcrypto-tool using system() or reticulate, and then load the decrypted version into R just fine. Don't forget to clean up if the data is sensitive. This does require python, but at least it doesn't require excel or java, or rewriting my existing code that used readxl. :shrug:

jennybc commented 2 years ago

Thanks for leaving some notes!