Test suite - Githubissues

smahood commented 8 years ago

I've been thinking about building a test suite to help validate that things are working for issue #26 (and now it would probably help with issue #28 as well). Have you got any preference as to testing library or example projects that I should follow as far as the test styles?

I'm planning on building some tests anyway to help verify that some of my more critical reports are working over the long term, so if the tests I write don't have an eventual home in excel-templates itself they will still serve a useful purpose for me.

tomfaulhaber commented 8 years ago

A good test suite is something that we are sorely lacking. I'm not too worried about which framework we pick.

I've been thinking about this, too, and here are some of my thoughts:

One thing that is important (and essentially the basis of any good testing we would do here) is a library of tools that let us make assertions about the spreadsheets, presumably going around POI and operating on the ZIP/XML directly (because it will be more robust).

One important assertion would be that workbook is a correct workbook. In general that means that all the linkages are right and there are no orphans and that all formulas point to real things (both on worksheets and on charts). Over time, we can add more checks around structural correctness within the elements.

It's probably worth adding POI to this as a separate check of its own, where we open the worksheet with POI and walk the elements, making sure that POI doesn't explode along the way.

Other assertions would be about the size of a worksheet, the contents of a block of cells, and the value of formulae (both on worksheets and chart references).

Like #28, a POI-independent formula parser would be helpful here, though we certainly can get started without it.

smahood commented 8 years ago

I agree wholeheartedly with those points, and based on the bugs and human errors that I've encountered so far I think that the quality and sources of the workbooks is going to be a critical part of having good and useful tests. The other interesting side of a testing suite is that we may be able to use it as a starting point to tackle #28 by starting with POI based tests and then transitioning them over to direct manipulation and verification of the ZIP/XML as we figure out how to deal with the different components.

As a starting point, here's a way that may work for testing using POI and the current capabilities of the library.

Can I read in an excel file with known values, reproduce the file exactly to a temporary location, and read back and verify those same values? This could be extended so that the base testing file (potentially the same file created in different environments and different versions of Excel and Excel alternatives) can cover multiple scenarios (Cell A1 should = 1, Cell A3 should be a formula with a certain value, etc). In this manner, we could potentially extend that one excel file (or create many for different tests) with one feature at a time to build up the tests over time.

I'm a little uncomfortable with creating a bunch of temp files as part of the tests, but I don't really see a good way around it when using POI.

I would ideally like to build some of the validation into a part of the library that can be used in production - I have some templates where I would like to have the formulas and results of test runs validated before running reports on an ongoing basis to make sure I don't break reports over time.

tomfaulhaber commented 8 years ago

Yeah, basically the right points. I think that we can skip the POI-based stuff because implementing the first step straight from the XML is not hard with (a) the tools clojure gives you and (b) the understanding I've gotten about how Excel works inside from working on this.

As a start, try running this script on your excel file and then poke around the resulting file tree: https://gist.github.com/tomfaulhaber/4f1db38af6e57ac648e9. I think you'll see that the cells on the worksheets are really easy to find. Using zippers and data.xml I think it's pretty easy to write a library that we can build assertions on.

@smahood, are you coming up to ClojureWest this year? If so, we could do a mini-hackathon to get things started.

smahood commented 8 years ago

@tomfaulhaber Not able to come this year, that would have been a great way to get this started. Any good references to start looking into the internals of Excel? I'd love to get something started, either on the testing side or the non-POI rewrite.

tomfaulhaber commented 8 years ago

Hey @smahood, sorry to hear that you won't be in Seattle. We'll try to have fun without you :)

Official documentation

The bible of Office XML is the ECMA 376 which has a home page here and the key document is the ECMA-376, 4th Edition, Office Open XML File Formats — Fundamentals and Markup Language Reference which is literally 5000 pages long. Don't worry, though, I haven't really found much need for that, although it is comforting to know I have it available if I need it. (Just now, I discovered that the complete formula grammar is in there starting on page 2043. That might come in handy.)

Unbundling an existing spreadsheet

If I were you, the first thing I would do is use the script I mentioned above and blow up one of your interesting excel workbooks (and also maybe the one in excel-templates-example). This turns the excel file into a directory tree and prettifies all the XML so that it's pretty easy to navigate. It also creates a .projectile file if you're that way inclined.

Exploring

Start by looking at the layout of the worksheets themselves and see what's going on. You will see rows, cells, values and formulas in there. If you look at a sheet like xl/worksheets/sheet1.xml, you'll see cell definitions like:

<c r="Q5" s="5">
  <f>N5-H5</f>
  <v>0</v>
</c>

The s="5" attribute is a (zero-based) style index. The styles are in xl/styles.xml under the tag cellXfs. In our app, we mostly don't need to worry about the styles except to make sure they are duplicated correctly, but it might be important to refer to them in testing since it's easier to make assertions about.

The other thing to think about is the linkage between files. For us, the most important linkage is the worksheet -> drawing -> chart relationship. Relationships are specified by an ID, like rId1 that are used in an r:id attribute of an object to indicate that the actual object is somewhere else:

<drawing r:id="rId1"/>

The target of the relationship is specified by a separate file. If you are looking at xl/worksheets/sheet1.xml, the relationships will be specified in xl/worksheets/_rels/sheet1.xml.rels which will look something like this:

<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  <Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/drawing" Target="../drawings/drawing1.xml"/>
</Relationships>

For the case of charts, you need to be aware that sheets have drawings and drawings have the charts. So you need to follow that linkage two steps (not too difficult once you know what you're looking for).

Existing XML code

As I mention elsewhere, I have long ago started down the path of manipulating the XML directly because POI was either not working right or was just in the way.

The two cases are:

Manipulating the charts themselves to update the data that they point at and replicate series when appropriate. This is the code in charts.clj up until the comment that begins ;;; Code for copying charts when....
All the code to correctly clone a chart requires us to first delete the chart (after getting all the info from the sheet), and then recreate the chart (modified) on the new sheets. To do all of this requires us to handle the linkage described above. There are two parts to this code. In build.clj, from charts-from-file to insert-charts, and in charts.clj from the comment ;;; Code for copying charts when... almost to the end.

These two areas will give you some feeling for how I've dealt with the XML issues so far. I think building a capability to do some simple validation and check assertions based on the file structure will not be too hard, despite the extraordinary length of this comment.

Let me know if and how I can help you along the way.

tomfaulhaber / excel-templates

Test suite #30

Official documentation

Unbundling an existing spreadsheet

Exploring

Existing XML code