troldal / OpenXLSX

A C++ library for reading, writing, creating and modifying Microsoft Excel® (.xlsx) files.
BSD 3-Clause "New" or "Revised" License
1.28k stars 306 forks source link

Spreadsheets with workbook name other than /xl/workbook.xml in _rels/.rels cannot be opened #254

Open janhec opened 1 week ago

janhec commented 1 week ago

workbook.xml is hardcoded in xldocument.cpp, but is actually variable. I do not encounter this problem when (re)saving a file in excel (or libreoffice calc), the workbooks in question get outputted by a program that I need in my (customers) workflow. Opening and saving in excel or libreoffice will cure this, but this is not a welcome addition to the (automated) workflow. In these cases, the name will be /xl/workbook22.xml, which is not repeated in [Content_Types].xml (in contrast to workbook.xml in the ordinary case). Some (copilot) prodding got me: Variability in Workbook Paths: You mentioned that sometimes the workbook path is workbook22.xml. The specific name (e.g., workbook22.xml) can vary based on the workbook’s history, edits, and other factors. Excel assigns unique names to different versions of the workbook, especially when you make changes or save multiple copies. Why Different Names?: The variation in workbook names allows Excel to manage different versions, track changes, and handle concurrent editing (e.g., when collaborating with others). Each time you save the workbook, Excel may increment the number or use a different identifier to avoid overwriting existing files. The last line seems a dumb addition, but otherwise it sounds appropriate to me, even if I did not find a trigger for this behavior. The initial bug is that workbook.xml will (obviously) not be found in m_data so an exception is thrown. I tried feeding m_data with data (including xmlid) from _rels.rels, which will make the open succeed apparently, but then the sheets will not be found (older version of openxlsx). Another gotcha in this case is that shared strings has x:si instead of si which gets flagged in the may 25 revision. So I am not confident to find a solution myself and hope for any comments.

janhec commented 1 week ago

workbook22.xml contains:

...views...sheets... The workbook id in .rels is not echoed in the sheets, e.g. sheetId="1" r:id="R4b78b06d10804515" vs .rels: Target="/xl/workbook22.xml" Id="R8892f0194c3849b0". So it seems a lot of additional code may be needed to cover this case and perhaps the "SpreadsheetLight" moniker will help.
troldal commented 1 week ago

I would say that this not a bug in OpenXLSX, but a bug in whatever produced the workbook you are trying to open. In ECMA-376 (the official standard for the Office Open XML formats) part 1, section 12.2 stipulates that the workbook data must be physically located in the /workbook.xml file.

image

So a workbook file named workbook22.xml is clearly non-standard

janhec commented 1 week ago

Thank you for the info. I admit not reading the standard. Too lazy and/or busy with other things.

There remains a nag on my mind that the standard may not exclude older standards, so we should adhere to /xl/workbook.xml when creating new (and preferably, saving old) workbooks, but still be able to open this case.

Reasons: appName="SpreadsheetLight" in /xl/workbook22.xml and the fact that there is a library for this (https://github.com/ARLM-Keller/SpreadsheetLight, presumably relevant, did not check further). Then (again) both excel and libreoffice will open these files; after an excel save, the (new) standard is applied. From a worksheet the problematic xmlns (x:): xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main" (I could not access this and similar through a browser.)

Looking at modern standard spec (in workbook.xml), I see e.g.: xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" mc:Ignorable="x15 xr xr6 xr10 xr2" xmlns:x15="http://schemas.microsoft.com/office/spreadsheetml/2010/11/main" xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision" xmlns:xr6="http://schemas.microsoft.com/office/spreadsheetml/2016/revision6" xmlns:xr10="http://schemas.microsoft.com/office/spreadsheetml/2016/revision10" xmlns:xr2="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2"> which seems to include a lot of versions/revisions.

I have solved my problem in a rather hacky way by changing XL_Document and XLXmlData (cpp and hpp), took me a few hours. After making workbook.xml (and workbook.xml.rels) variable (namewise) and getting the name from .rels, putting it in m_data: When loading (and finding nonstandard workbook*.xml), I remove the problematic xmlns (x:) and record the distinct names by xml source; when saving, I put them back (hoping for no conflicts of any kind).

I have not investigated whether this is more generally adequate and useful. I doubt you want to incorporate such mods, but if you wish to look at them, let me know.

No offense if you close this.

aral-matrix commented 1 week ago

@janhec : Out of curiosity: can you summarize the exact relation of all XML files and fields that are different with respect to your workbook22.xml?

I am not entirely sure how your _rels/.rels file is looking - is the workbook22.xml referenced like so?

<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="xl/workbook22.xml"/>

The important bit I am interested in here is whether or not the exact type string is reflected here, so that the document format is at least logically consistent and the workbook file name could be looked up by checking the _rels/.rels file for the relationship entry with that type.

janhec commented 1 week ago

The .rels file looks like this: `<?xml version="1.0" encoding="utf-8"?>

`

workbook22.xml goes like this: `<?xml version="1.0" encoding="utf-8"?>

... more sheets ... ` The workbook filename can be looked up. A bit of code I added in XLDocument.cpp, open() about line 460 std::string wbpath; bool wbfound = false; for (auto rel : m_docRelationships.relationships()) { if (rel.type() == XLRelationshipType::Workbook) { wbpath = rel.target(); auto xmlid = rel.id(); // trying both id's: as they say, identical //xmlid = m_docRelationships.relationshipByTarget(wbpath).id(); if (wbpath[0] == '/') wbpath.erase(0, 1); m_data.emplace_back(/* parentDoc */ this, /* xmlPath */ wbpath, /* xmlID */ xmlid, /* xmlType */ XLContentType::Workbook); wbfound = true; break; } } if (!wbfound) wbpath = "xl/workbook.xml"; needsxcheck = wbpath != "xl/workbook.xml"; // consider check appName="SpreadsheetLight" in xl/workbook*.xml const std::string wkbrels = "xl/_rels/" + wbpath.substr(3) + ".rels"; m_data.emplace_back(this, wkbrels); //, "", XLContentType::Workbook); m_wbkRelationships = XLRelationships(getXmlData(wkbrels));
aral-matrix commented 1 week ago

Hmm... I am a bit worried about the worksheets r:id="R4b78b06d10804515" in workbook22.xml - OpenXLSX has a bunch of code that makes assumptions about how the r:id values are of the format "rId###" where ### is strictly numerical.

Could you provide a full example of one of your xlsx files with a workbook22.xml so that I can have a look at whether it could be supported without major changes?

A file that does load without problems in libreoffice would be great :)

janhec commented 1 week ago

Please treat the appended xlsx file as confidential in the sense of not sharing it widely, e.g. through putting it on a site. Keep private if possible. I tried shortening it without losing the workbook22 thing, but failed to produce a readily loadable result, time is up for a bit.

Btw, I had no apparent difficulty with the r:id's. Didn't notice that a format was enforced, anyhow I may not have gone deep enough to notice. But I could load and save it after my mods.

Thanks for looking into this! Jan

PS shouldn't mention it here, perhaps, but I had trouble appending rows with new rows having to be added to the worksheet. This worked fine with the previous (2024-05-04) lib, but not in the 2024-05-24 version.

XLWorksheet::Row(unint32_t) constructor -> getRowNode(XMLNode, uint32_t), rowNumber > result.attribute("r").as_ullong(), return -> XLQuery -> execQuery (SharedStrings) result empty XLRow(const XMLNode&, const XLSharedStrings&). Subsequent .cells() gave me m_rowNode empty, m_firstCol 1, m_lastCol 0. .cells().begin() then crashes when doing a unique_ptr: Exception thrown: read access violation. std::forward<pugi::xml_node & __ptr64>(...) returned nullptr.

I'll still have to replay this without my mods, but it does not obviously seem connected to the mods. Advice welcome.

On Thu, Jul 11, 2024 at 2:29 AM aral-matrix @.***> wrote:

Hmm... I am a bit worried about the worksheets r:id="R4b78b06d10804515" in workbook22.xml - OpenXLSX has a bunch of code that makes assumptions about how the r:id values are of the format "rId###" where ### is strictly numerical.

Could you provide a full example of one of your xlsx files with a workbook22.xml so that I can have a look at whether it could be supported without major changes?

A file that does load without problems in libreoffice would be great :)

— Reply to this email directly, view it on GitHub https://github.com/troldal/OpenXLSX/issues/254#issuecomment-2221764142, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTBHL7JAPMMXXJ4VBACJO3ZLXGX7AVCNFSM6AAAAABKNE2BZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRRG43DIMJUGI . You are receiving this because you were mentioned.Message ID: @.***>

aral-matrix commented 1 week ago

Please treat the appended xlsx file as confidential in the sense of not sharing it widely

I am not sure if I am missing something w.r.t. how github is used, but I do not see an attachment. You can reach me via my codeberg profile https://codeberg.org/lars_uffmann - email is listed there.

janhec commented 1 week ago

Sorry if I appear blind, but I don't see an email address on your codeberg profile. Perhaps I should register? Can you send me an email through the email address intended, so I can reply and you get the attachment? Thanks, Jan

On Thu, Jul 11, 2024 at 11:03 AM aral-matrix @.***> wrote:

Please treat the appended xlsx file as confidential in the sense of not sharing it widely

I am not sure if I am missing something w.r.t. how github is used, but I do not see an attachment. You can reach me via my codeberg profile https://codeberg.org/lars_uffmann - email is listed there.

— Reply to this email directly, view it on GitHub https://github.com/troldal/OpenXLSX/issues/254#issuecomment-2222404318, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTBHL3GRRCM5URO6HIFNB3ZLZC4NAVCNFSM6AAAAABKNE2BZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRSGQYDIMZRHA . You are receiving this because you were mentioned.Message ID: @.***>

aral-matrix commented 1 week ago

my apologies, it indeed doesn't show up to non-registered users - I don't see your email either though :D

However, my project files contain my email, e.g. this one: https://codeberg.org/lars_uffmann/fwmqueue/src/branch/master/demo.cpp (line #5) - I just want to avoid linking it directly here with this github account, I hate microsoft with a passion :D