Closed adam-nielsen closed 8 years ago
Nice debugging work. I've added a fix to the shared strings serialization code that should take care of this for you. I'm currently refactoring much of the serialization so I don't want to spend too much time on making this more performant (specifically removing duplicates after loading the workbook).
Perfect, many thanks! Performance isn't a problem for me at the moment so no complaints there.
From the ISO/IEC 29500 18.4.9 sst(Shared String Table) "A string is unique even if it is a copy of another string, but has different formatting applied at the character level."
Character level formatting will break the string into multiple text runs (multiple "t" child). The current code only picks up the first text run. More over, since it ignores all the character level formatting, unique "si" may still end up the same string value.
Thanks Wei. I should have checked the standard. I'll be pushing some big updates in the next week so I'll try to implement this more correctly in that release.
Thanks! Great work!
FYI, unrelated to @wei-sketchup's issue, I have worked out the cause of the duplicate shared strings in the #42 test case. It turns out that Excel only does string deduplication when you enter them by hand. When you copy and paste a large block of data (e.g. from a .csv file) then Excel does not perform deduplication, I guess for performance reasons. This is why you end up with duplicate shared strings in that case where you really shouldn't as they are truly identical.
I appreciate the follow-up Adam. I was debating asking you how I could reproduce this issue. I have a lot to go on now!
If a spreadsheet produced by Excel has duplicate strings, xlnt throws an exception when the spreadsheet is opened.
It is caused when the code tries to reference a shared string, but somehow the shared string has been lost, so the index given in the XML file is out of range.
I have narrowed down the cause of this to the same string being listed more than once in the shared strings XML file. It seems a mistake to have two identical strings here (defeats the purpose of having shared strings in the first place), however this is how Excel saves the file, so it would be nice if xlnt could work around the bug and read the file anyway.
I have created a test case that demonstrates this problem, which I will add below as a pull request.