tidyverse / readxl

Read excel files (.xls and .xlsx) into R 🖇
https://readxl.tidyverse.org
Other
729 stars 194 forks source link

Error when a sheet doesn't have a file in /xl/worksheets #408

Closed nacnudus closed 6 years ago

nacnudus commented 6 years ago

One of the Enron files has a sheet, Module1, that doesn't have a corresponding file in /xl/worksheets. This causes an error when importing the sheet, e.g. when importing every sheet of every file in a directory.

daren_farmer6529egmnom-Jan.xlsx

readxl::excel_sheets("daren_farmer__6529__egmnom-Jan.xlsx")
#> [1] "EGM 60"       "DELV SUMMARY" "EGM 201"      "EGM 202"     
#> [5] "RHODIA"       "Module1"

readxl::read_excel("daren_farmer__6529__egmnom-Jan.xlsx", "Module1")
#> Error in read_fun(path = path, sheet = sheet, limits = limits, shim = shim, : `` not found

The problem disappears (for me) when resaving the file with Excel 2010, because Excel creates a file /xl/worksheets/sheet6.xml. So this is presumably rare.

It is named by excel_sheets() because it is named in /xl/workbook.xml, with an empty r:id. So one way to handle it would be not to include a sheet in excel_sheets() if the r:id is empty. Or maybe a graceful failure would be preferable -- I don't really have an opinion.

<workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:x15="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main" mc:Ignorable="x15">
<fileVersion appName="xl" lastEdited="6" lowestEdited="6" rupBuild="14420"/>
<workbookPr codeName="ThisWorkbook" defaultThemeVersion="153222"/>
<mc:AlternateContent xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006">
<mc:Choice Requires="x15">
<x15ac:absPath xmlns:x15ac="http://schemas.microsoft.com/office/spreadsheetml/2010/11/ac" url="C:\Users\Felienne\Enron\EnronSpreadsheets\"/>
</mc:Choice>
</mc:AlternateContent>
<bookViews>
<workbookView xWindow="240" yWindow="90" windowWidth="9135" windowHeight="4965" tabRatio="721" activeTab="2"/>
</bookViews>
<sheets>
<sheet name="EGM 60" sheetId="20" r:id="rId1"/>
<sheet name="DELV SUMMARY" sheetId="21" r:id="rId2"/>
<sheet name="EGM 201" sheetId="22" r:id="rId3"/>
<sheet name="EGM 202" sheetId="23" r:id="rId4"/>
<sheet name="RHODIA" sheetId="24" r:id="rId5"/>
<sheet name="Module1" sheetId="19" state="veryHidden" r:id=""/>
</sheets>
<definedNames>
<definedName name="_xlnm.Print_Area" localSheetId="2">'EGM 201'!$A$1:$J$75</definedName>
<definedName name="_xlnm.Print_Area" localSheetId="3">'EGM 202'!$A$1:$D$53</definedName>
<definedName name="_xlnm.Print_Area" localSheetId="0">'EGM 60'!$A$1:$D$34</definedName>
<definedName name="_xlnm.Print_Area" localSheetId="4">RHODIA!$A$1:$D$26</definedName>
</definedNames>
<calcPr calcId="152511" iterate="1" iterateCount="1"/>
</workbook>

I don't think the state="veryHidden" property is anything to do with it, at least not with modern versions of Excel. It doesn't prevent there being an /xl/worksheets/sheet6.xml file. The property controls the visibility of the sheet in the Excel UI, and veryHidden means it is invisible in both the normal view and the VBA window, and can only be accessed via VBA code (or by unzipping the file).

jennybc commented 6 years ago

How do you think such an xlsx file comes to be 🤔? What I really mean is, do you think this happens in the wild or is something deeply peculiar to the history of that file? If it happens in the wild, I'm surprised we've never had a report before but who knows.

nacnudus commented 6 years ago

I was wondering the same thing, because Enron predates the xlsx format. The file came from https://gitlab.com/rsheets/enron_corpus -- do you remember whether they were converted from xls to xls first? Anyway, I'd say it's deeply peculiar and I'm not bothered by it, just thought it worth noting.

jennybc commented 6 years ago

I'm 99% sure @richfitz did not do the xls to xlsx conversion, i.e. that was done by Felienne Hermanns group or earlier, when the corpus was assembled.

jennybc commented 6 years ago

Does this one sheet remain our sole example of this? If so, I am inclined to not act on it.

jennybc commented 6 years ago

Something I did seems to have fixed this. Probably eeeebf8171540a7cd14b373d20b08efbac7e3cd2.

🎉

jennybc commented 6 years ago

Oh nevermind. I still can't read that exact sheet, although I can read others. I think, until I see more examples of this, I am not going to worry about it.

jennybc commented 6 years ago

I did some excelgesis on it. Here's the sheets node in workbook.xml:

<sheets>
<sheet name="EGM 60" sheetId="20" r:id="rId1"/>
<sheet name="DELV SUMMARY" sheetId="21" r:id="rId2"/>
<sheet name="EGM 201" sheetId="22" r:id="rId3"/>
<sheet name="EGM 202" sheetId="23" r:id="rId4"/>
<sheet name="RHODIA" sheetId="24" r:id="rId5"/>
<sheet name="Module1" sheetId="19" state="veryHidden" r:id=""/>
</sheets>

So this is a matter of: how to handle "veryHidden" sheets?

Update: I now see you @nacnudus said most of this above yourself.

jennybc commented 6 years ago

I think having a stance on hidden sheets is important. But I doubt this spreadsheet is a good example to base such work on. I suspect that this hit some weird edge case in whatever tool was used to create the Enron corpus, which we suspect involved some xls to xlsx conversion.

If I resave this file, the "Module1" sheet remains listed in workbook.xml, remains veryHidden, gains a proper r:id attribute, and gains its own XML file below xl/worksheets/ (but with empty sheetData node). As @nacnudus saw as well.

Back to this position: unless we have ongoing reports of this, I will let this be. I could change the sheet parsing to only acknowledge a sheet if the r:id attribute has more than zero characters but I think it's not worth complicating the code for this. It seems reasonable to use tryCatch() when doing something ambitious, like reading every putative worksheet in the Enron corpus.