roo-rb / roo

Roo provides an interface to spreadsheets of several sorts.
MIT License
2.78k stars 502 forks source link

Fix xlsx parsing of empty cells with no formatting #591

Open agius opened 1 year ago

agius commented 1 year ago

I found a spreadsheet in the wild with a bunch of empty-value cells.

<c r="K58" s="2" t="s">
  <v>65</v>
</c>
<c r="L58">
  <v/>
</c>
<c r="M58" t="s">
  <v>158</v>
</c>

Because there was no format specified for the cell, it had at least one child XML node, and there was no value, the default behavior was to treat the cell as a number type and try to parse the value (which was converted to empty-string). Using the kernel-level method Integer() , this threw an exception. The exception was also surprisingly hard to trace; it didn't have a full stacktrace in the wild and didn't appear in $! in IRB; only got a trace once I got it into RSpec.

Apple's "Numbers" program converts the cell to zero; Google Sheets leaves it blank. I'm not sure what Excel itself does with this kind of XML. This sheet was produced by KendoUI, some angular framework with tables, so it's probably not very common.

This change adds a nil check to the typecast, and returns 0 or 0.0 on empty-string values. I've included a spec and an example file. The file from the wild contains sensitive data, but I reproduced the empty-cell XML by hand-editing an export from Google Sheets.

I'm not 100% sure this is the right behavior, and I'm open to alternatives. Maybe we could check in SheetDoc and return an Empty cell type instead? Or add an option for "throw exception" vs "default to zero?" But it feels like since Numbers and Sheets can open these files, Roo probably shouldn't throw an exception here.

agius commented 1 year ago

cc @patrickkulling ? Not sure who to ask for review on this. Thanks!