troldal / OpenXLSX

A C++ library for reading, writing, creating and modifying Microsoft Excel® (.xlsx) files.
BSD 3-Clause "New" or "Revised" License
1.4k stars 337 forks source link

Writing largeish sheets/xlsx-files(30-50MB) seems very slow/unable to complete writing files from std::strings #241

Open og-yona opened 8 months ago

og-yona commented 8 months ago

Hello!

I'm working with a project where I need to handle semi-large csv- and xlsx-files, and I tried to add OpenXLSX as part of my project for handling reading and writing the xlsx-file part.

Reading the example xlsx-file with 5 sheets to std::string-storages finishes in 7 seconds, which is very nice and fast.

But when trying to write the same data from std::strings as a new xlsx -file, the process keeps getting exponentially slower the more data/sheets it has already written. Basically OpenXLSX was unable to complete writing the data back to xlsx. I waited for 1,5 hours and had to kill the process becouse it was seemingly stuck at writing one sheet.

Writing a cell/row at a time makes basically no difference.

My problem might be related to this issue: https://github.com/troldal/OpenXLSX/issues/154

Is there any way to skip the shared strings -checks, and just write everything as plain strings? Or does someone have any other tips which might make writing files actually usable when dealing with larger random string-data?

image

og-yona commented 8 months ago

Answering for myself, and for future reference in case someone is having the same issue.

Looking around the openxlsx files I managed to find a sort of fix, at least for my case.:

in XLCellValue.cpp i commented out lines 402, 405 and 409:

// ===== Set the type attribute. m_cellNode->attribute("t").set_value("s"); // ===== Get or create the index in the XLSharedStrings object. auto index = (m_cell->m_sharedStrings.stringExists(stringValue) ? m_cell->m_sharedStrings.getStringIndex(stringValue) : m_cell->m_sharedStrings.appendString(stringValue)); // ===== Set the text of the value node. m_cellNode->child("v").text().set(index);

and uncommented lines 412 and 413 instead:

// m_cellNode->attribute("t").set_value("str"); // m_cellNode->child("v").text().set(stringValue);

without touching the following lines, uncommenting these at 415-419 caused problems....

// auto s = std::string_view(stringValue); // if (s.front() == ' ' || s.back() == ' ') { // if (!m_cellNode->attribute("xml:space")) m_cellNode->append_attribute("xml:space"); // m_cellNode->attribute("xml:space").set_value("preserve"); // }

Saving my earlier example file was now done in less than 30 seconds, which is around what I was hoping for: image

edit: accidently closed the issue, not sure if my comment/uncomment tweak counts actually as solving this whole issue. image

aral-matrix commented 3 months ago

Subscribed myself to this issue so I can have a look into it eventually. The shared strings logic in the case that you describe might indeed benefit from handling an ordered set in memory, and only writing the XML file once (from that set) when the document is saved.