pythonicrubyist / creek

Ruby library for parsing large Excel files.
http://rubygems.org/gems/creek
MIT License
386 stars 109 forks source link

Handle XML namespaces in worksheets #101

Closed bschmeck closed 1 year ago

bschmeck commented 2 years ago

We've run into an issue with parsing an XLSX when the nodes are namespaced (e.g. <x:row>).

~This PR addresses that issue by using the local_name method when looking for row, c, v and t nodes. The name method includes the namespace, e.g. x:row, but local_name will strip the namespace prefix, allowing the existing comparison logic to work.~

This PR addresses that issue by identifying the namespace prefix (if there is one) while SAX parsing the sheet and looking for nodes whose name includes the prefix.

Additionally, when the shared strings dictionary is built, this PR identifies the namespace prefix (if there is one) and includes the namespace in the CSS query used to parse the dictionary. An alternative approach would be to call remove_namespaces! on the document, but that seems a bit heavy handed.

bschmeck commented 2 years ago

After thinking about it more, I decided that it makes more sense to use the approach taken for the shared strings dictionary when parsing the sheet's rows as well. Using local_name is akin to calling remove_namespaces! which runs the risk of parsing nodes that we shouldn't (nodes named row, c, v or t but in a different namespace).

Making the row parsing logic namespace aware seems like the better solution.