Closed nacnudus closed 6 years ago
How do you think such an xlsx file comes to be 🤔? What I really mean is, do you think this happens in the wild or is something deeply peculiar to the history of that file? If it happens in the wild, I'm surprised we've never had a report before but who knows.
I was wondering the same thing, because Enron predates the xlsx format. The file came from https://gitlab.com/rsheets/enron_corpus -- do you remember whether they were converted from xls to xls first? Anyway, I'd say it's deeply peculiar and I'm not bothered by it, just thought it worth noting.
I'm 99% sure @richfitz did not do the xls to xlsx conversion, i.e. that was done by Felienne Hermanns group or earlier, when the corpus was assembled.
Does this one sheet remain our sole example of this? If so, I am inclined to not act on it.
Something I did seems to have fixed this. Probably eeeebf8171540a7cd14b373d20b08efbac7e3cd2.
🎉
Oh nevermind. I still can't read that exact sheet, although I can read others. I think, until I see more examples of this, I am not going to worry about it.
I did some excelgesis on it. Here's the sheets node in workbook.xml
:
<sheets>
<sheet name="EGM 60" sheetId="20" r:id="rId1"/>
<sheet name="DELV SUMMARY" sheetId="21" r:id="rId2"/>
<sheet name="EGM 201" sheetId="22" r:id="rId3"/>
<sheet name="EGM 202" sheetId="23" r:id="rId4"/>
<sheet name="RHODIA" sheetId="24" r:id="rId5"/>
<sheet name="Module1" sheetId="19" state="veryHidden" r:id=""/>
</sheets>
So this is a matter of: how to handle "veryHidden" sheets?
Update: I now see you @nacnudus said most of this above yourself.
I think having a stance on hidden sheets is important. But I doubt this spreadsheet is a good example to base such work on. I suspect that this hit some weird edge case in whatever tool was used to create the Enron corpus, which we suspect involved some xls to xlsx conversion.
If I resave this file, the "Module1" sheet remains listed in workbook.xml
, remains veryHidden
, gains a proper r:id
attribute, and gains its own XML file below xl/worksheets/
(but with empty sheetData
node). As @nacnudus saw as well.
Back to this position: unless we have ongoing reports of this, I will let this be. I could change the sheet parsing to only acknowledge a sheet if the r:id
attribute has more than zero characters but I think it's not worth complicating the code for this. It seems reasonable to use tryCatch()
when doing something ambitious, like reading every putative worksheet in the Enron corpus.
One of the Enron files has a sheet,
Module1
, that doesn't have a corresponding file in/xl/worksheets
. This causes an error when importing the sheet, e.g. when importing every sheet of every file in a directory.daren_farmer6529egmnom-Jan.xlsx
The problem disappears (for me) when resaving the file with Excel 2010, because Excel creates a file
/xl/worksheets/sheet6.xml
. So this is presumably rare.It is named by
excel_sheets()
because it is named in/xl/workbook.xml
, with an emptyr:id
. So one way to handle it would be not to include a sheet inexcel_sheets()
if ther:id
is empty. Or maybe a graceful failure would be preferable -- I don't really have an opinion.I don't think the
state="veryHidden"
property is anything to do with it, at least not with modern versions of Excel. It doesn't prevent there being an/xl/worksheets/sheet6.xml
file. The property controls the visibility of the sheet in the Excel UI, andveryHidden
means it is invisible in both the normal view and the VBA window, and can only be accessed via VBA code (or by unzipping the file).