troldal / OpenXLSX

A C++ library for reading, writing, creating and modifying Microsoft Excel® (.xlsx) files.
BSD 3-Clause "New" or "Revised" License
1.37k stars 333 forks source link

String writing using different values for each row is extremely slow #154

Open pkulijing opened 2 years ago

pkulijing commented 2 years ago

Just change the benchmarking code BM_WriteStrings a little bit. Original code:

std::deque<XLCellValue> values(colCount, "OpenXLSX");

for (auto _ : state)    // NOLINT
    for (auto& row : wks.rows(rowCount)) row.values() = values;

Changed code:

for (auto _ : state)    // NOLINT 
    for (auto& row : wks.rows(rowCount)) {
        std::deque<XLCellValue> values(colCount, std::to_string(row.rowNumber()));
      row.values() = values;
    }

The code becomes so slow that the benchmarking is not finished after a few minutes. Similar modification to BM_WriteIntegers causes no problem. In practical, it's rare that different rows share the same value. Thus the library behaves much slower than claimed in the doc when there are a lot of strings in the excel file. In my case, writing a file with ~20000 rows, 3 columns (all string) takes 40 seconds. By attaching the process with gdb I find that it seems to be doing a lot of string comparisons to avoid creating duplicate string nodes in the xml file. With m_stringCache being a std::deque, the performance becomes unacceptable when the number of strings increases. Is it possible to use std::set?

(gdb) bt
#0  0x0000562cb653402e in std::operator==<char> (__rhs="8396", __lhs="82284")
    at /usr/include/c++/9/bits/basic_string.h:6152
#1  OpenXLSX::XLSharedStrings::<lambda(const string&)>::operator() (__closure=<synthetic pointer>, 
    s="8396")
    at /home/lijing17/Developer/lireplay/webserver/third_party/openxlsx/OpenXLSX/sources/XLSharedStrings.cpp:74
#2  __gnu_cxx::__ops::_Iter_pred<OpenXLSX::XLSharedStrings::getStringIndex(const string&) const::<lambda(const string&)> >::operator()<std::_Deque_iterator<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>&, std::__cxx11::basic_string<char>*> > (__it=..., this=<synthetic pointer>)
    at /usr/include/c++/9/bits/predefined_ops.h:283
#3  std::__find_if<std::_Deque_iterator<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>&, std::__cxx11::basic_string<char>*>, __gnu_cxx::__ops::_Iter_pred<OpenXLSX::XLSharedStrings::getStringIndex(const string&) const::<lambda(const string&)> > > (__pred=..., __last=..., __first=...)
    at /usr/include/c++/9/bits/stl_algo.h:120
#4  std::__find_if<std::_Deque_iterator<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>&, std::__cxx11::basic_string<char>*>, __gnu_cxx::__ops::_Iter_pred<OpenXLSX::XLSharedStrings::getStringIndex(const string&) const::<lambda(const string&)> > > (__pred=..., __last=..., __first=...)
    at /usr/include/c++/9/bits/stl_algo.h:162
#5  std::find_if<std::_Deque_iterator<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>&, std::__cxx11::basic_string<char>*>, OpenXLSX::XLSharedStrings::getStringIndex(const string&) const::<lambda(const string&)> > (__pred=..., __last=..., __first=...)
    at /usr/include/c++/9/bits/stl_algo.h:3970
#6  OpenXLSX::XLSharedStrings::getStringIndex (this=this@entry=0x7ffd521afa98, str="82284")
    at /home/lijing17/Developer/lireplay/webserver/third_party/openxlsx/OpenXLSX/sources/XLSharedStrings.cpp:74
#7  0x0000562cb65342bd in OpenXLSX::XLSharedStrings::stringExists (this=this@entry=0x7ffd521afa98, 
    str="82284")
    at /home/lijing17/Developer/lireplay/webserver/third_party/openxlsx/OpenXLSX/sources/XLSharedStrings.cpp:84
#8  0x0000562cb6508882 in OpenXLSX::XLCellValueProxy::setString (this=this@entry=0x7ffd521afaa8, 
    stringValue=0x7ffd521afb00 "82284") at /usr/include/c++/9/bits/char_traits.h:300
og-yona commented 8 months ago

There is a small tweak which can be done to XLCellValue.cpp to skip the shared strings, which helped in my case: https://github.com/troldal/OpenXLSX/issues/241#issuecomment-2000082534