Just change the benchmarking code BM_WriteStrings a little bit. Original code:
std::deque<XLCellValue> values(colCount, "OpenXLSX");
for (auto _ : state) // NOLINT
for (auto& row : wks.rows(rowCount)) row.values() = values;
Changed code:
for (auto _ : state) // NOLINT
for (auto& row : wks.rows(rowCount)) {
std::deque<XLCellValue> values(colCount, std::to_string(row.rowNumber()));
row.values() = values;
}
The code becomes so slow that the benchmarking is not finished after a few minutes. Similar modification to BM_WriteIntegers causes no problem. In practical, it's rare that different rows share the same value. Thus the library behaves much slower than claimed in the doc when there are a lot of strings in the excel file. In my case, writing a file with ~20000 rows, 3 columns (all string) takes 40 seconds.
By attaching the process with gdb I find that it seems to be doing a lot of string comparisons to avoid creating duplicate string nodes in the xml file. With m_stringCache being a std::deque, the performance becomes unacceptable when the number of strings increases. Is it possible to use std::set?
(gdb) bt
#0 0x0000562cb653402e in std::operator==<char> (__rhs="8396", __lhs="82284")
at /usr/include/c++/9/bits/basic_string.h:6152
#1 OpenXLSX::XLSharedStrings::<lambda(const string&)>::operator() (__closure=<synthetic pointer>,
s="8396")
at /home/lijing17/Developer/lireplay/webserver/third_party/openxlsx/OpenXLSX/sources/XLSharedStrings.cpp:74
#2 __gnu_cxx::__ops::_Iter_pred<OpenXLSX::XLSharedStrings::getStringIndex(const string&) const::<lambda(const string&)> >::operator()<std::_Deque_iterator<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>&, std::__cxx11::basic_string<char>*> > (__it=..., this=<synthetic pointer>)
at /usr/include/c++/9/bits/predefined_ops.h:283
#3 std::__find_if<std::_Deque_iterator<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>&, std::__cxx11::basic_string<char>*>, __gnu_cxx::__ops::_Iter_pred<OpenXLSX::XLSharedStrings::getStringIndex(const string&) const::<lambda(const string&)> > > (__pred=..., __last=..., __first=...)
at /usr/include/c++/9/bits/stl_algo.h:120
#4 std::__find_if<std::_Deque_iterator<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>&, std::__cxx11::basic_string<char>*>, __gnu_cxx::__ops::_Iter_pred<OpenXLSX::XLSharedStrings::getStringIndex(const string&) const::<lambda(const string&)> > > (__pred=..., __last=..., __first=...)
at /usr/include/c++/9/bits/stl_algo.h:162
#5 std::find_if<std::_Deque_iterator<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char>&, std::__cxx11::basic_string<char>*>, OpenXLSX::XLSharedStrings::getStringIndex(const string&) const::<lambda(const string&)> > (__pred=..., __last=..., __first=...)
at /usr/include/c++/9/bits/stl_algo.h:3970
#6 OpenXLSX::XLSharedStrings::getStringIndex (this=this@entry=0x7ffd521afa98, str="82284")
at /home/lijing17/Developer/lireplay/webserver/third_party/openxlsx/OpenXLSX/sources/XLSharedStrings.cpp:74
#7 0x0000562cb65342bd in OpenXLSX::XLSharedStrings::stringExists (this=this@entry=0x7ffd521afa98,
str="82284")
at /home/lijing17/Developer/lireplay/webserver/third_party/openxlsx/OpenXLSX/sources/XLSharedStrings.cpp:84
#8 0x0000562cb6508882 in OpenXLSX::XLCellValueProxy::setString (this=this@entry=0x7ffd521afaa8,
stringValue=0x7ffd521afb00 "82284") at /usr/include/c++/9/bits/char_traits.h:300
Just change the benchmarking code
BM_WriteStrings
a little bit. Original code:Changed code:
The code becomes so slow that the benchmarking is not finished after a few minutes. Similar modification to
BM_WriteIntegers
causes no problem. In practical, it's rare that different rows share the same value. Thus the library behaves much slower than claimed in the doc when there are a lot of strings in the excel file. In my case, writing a file with ~20000 rows, 3 columns (all string) takes 40 seconds. By attaching the process with gdb I find that it seems to be doing a lot of string comparisons to avoid creating duplicate string nodes in the xml file. Withm_stringCache
being astd::deque
, the performance becomes unacceptable when the number of strings increases. Is it possible to usestd::set
?