tfussell / xlnt

:bar_chart: Cross-platform user-friendly xlsx library for C++11+
Other
1.5k stars 423 forks source link

Exception thrown with duplicate strings #41

Closed adam-nielsen closed 8 years ago

adam-nielsen commented 8 years ago

If a spreadsheet produced by Excel has duplicate strings, xlnt throws an exception when the spreadsheet is opened.

terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 5413) >= this->size() (which is 5412)

It is caused when the code tries to reference a shared string, but somehow the shared string has been lost, so the index given in the XML file is out of range.

I have narrowed down the cause of this to the same string being listed more than once in the shared strings XML file. It seems a mistake to have two identical strings here (defeats the purpose of having shared strings in the first place), however this is how Excel saves the file, so it would be nice if xlnt could work around the bug and read the file anyway.

I have created a test case that demonstrates this problem, which I will add below as a pull request.

tfussell commented 8 years ago

Nice debugging work. I've added a fix to the shared strings serialization code that should take care of this for you. I'm currently refactoring much of the serialization so I don't want to spend too much time on making this more performant (specifically removing duplicates after loading the workbook).

adam-nielsen commented 8 years ago

Perfect, many thanks! Performance isn't a problem for me at the moment so no complaints there.

wei-sketchup commented 8 years ago

From the ISO/IEC 29500 18.4.9 sst(Shared String Table) "A string is unique even if it is a copy of another string, but has different formatting applied at the character level."

Character level formatting will break the string into multiple text runs (multiple "t" child). The current code only picks up the first text run. More over, since it ignores all the character level formatting, unique "si" may still end up the same string value.

tfussell commented 8 years ago

Thanks Wei. I should have checked the standard. I'll be pushing some big updates in the next week so I'll try to implement this more correctly in that release.

wei-sketchup commented 8 years ago

Thanks! Great work!

adam-nielsen commented 8 years ago

FYI, unrelated to @wei-sketchup's issue, I have worked out the cause of the duplicate shared strings in the #42 test case. It turns out that Excel only does string deduplication when you enter them by hand. When you copy and paste a large block of data (e.g. from a .csv file) then Excel does not perform deduplication, I guess for performance reasons. This is why you end up with duplicate shared strings in that case where you really shouldn't as they are truly identical.

tfussell commented 8 years ago

I appreciate the follow-up Adam. I was debating asking you how I could reproduce this issue. I have a lot to go on now!