qrilka / xlsx

Simple and incomplete Excel file parser/writer
MIT License
130 stars 64 forks source link

Pr/high level features #36

Closed edsko closed 8 years ago

edsko commented 8 years ago

So, here it is, the master PR :) I've bundled them all in this PR, because they kind of build on each other, although each commit is an orthogonal feature. Some comments on each one:

  1. https://github.com/qrilka/xlsx/commit/8e2be9a1c861fe83e56dfb542da082ad74dda53d introduces a high level datatype StyleSheet for creating stylesheets, along with a rendering function

    renderStyleSheet :: StyleSheet -> Styles

    This approach I've taken throughout, for two reasons: I wanted to avoid changing the core xlsx code as much as possible, and moreover I did not want to spend time writing parsers for all these features. So the core datatype still has the raw Styles type which you will get when parsing a document, but you can use the structured types to create these stylesheets.

  2. Although the support added in (1) suffices to create stylesheets, it is still somewhat low-level. The core xlsx Cell type wants a cell style ID, and moreover when creating stylesheets you end up working with these low level style IDs. In https://github.com/qrilka/xlsx/commit/b3aad926e915c3e0da7e9a3af85964e249c43e6b I add a higher level interface in the form of a FormattedCell, which is a cell with explicit formatting attached to it, as well as a col-span and a row-span. Creating spreadsheets that are rich in layout is much easier with this type, which can then be translated with

    formatted :: Map (Int, Int) FormattedCell -> StyleSheet -> Formatted

    which returns a CellMap, a StyleSheet, and a list of cell merges (computed from the cell-span and row-span from the FormattedCells).

  3. https://github.com/qrilka/xlsx/commit/c639e5d1a1138da96d60370e245c582332e1fb2e adds support for in-cell formatting. In addition to a new RichTextRun type to represent rich text, this also required treating shared string tables with a bit more care, as they now contain rich text as well as plain text. This required a few minor changes to the core types. The most important one is that cells now have a fourth case:

    data CellValue = ... | CellRich [RichTextRun]

    Moreover, the writer now uses the new SharedStringTable type rather than Vector Text. NOTE: I did not change the parser at all, so that will still lose all rich text properties when loading Excel files. If you wanted to change that, a first step could be to start using the SharedStringTable type also in the parser; I added

    sstParse :: Cursor -> SharedStringTable

    as a first step, which constructs the SharedStringTable using the existing parseSharedStrings functionality. So this still loses the rich text properties, but it could be a first step. (sstParse is currently unused).

  4. https://github.com/qrilka/xlsx/commit/4541d5f57a7064631dcc17f6d0a31ae50848d864 adds support for SheetViews, which allows you to say stuff like "show the grid on this sheet", "add a split point", etc. This commit just adds support for "raw" sheetviews, much like the core xlsx package has support for raw stylesheets. Unlike the stylesheet, however, the raw sheetview is an XML node; this is important because it sits in the middle of a larger XML document.
  5. https://github.com/qrilka/xlsx/commit/740f3e5e732fb904a22e108c1f96689013e0a960 then adds high-level support for creating SheetViews, along with a rendering function

    renderSheetViews :: [SheetView] -> RawSheetViews
  6. Similarly https://github.com/qrilka/xlsx/commit/ac97f43ae36a9feedcb000031f1708341d4a5b28 adds support for raw page setup (again, and XML node) and finally https://github.com/qrilka/xlsx/commit/3416536d2b9ffae281af887768130a0303275c8e adds high-level support in the form of

    renderPageSetup :: PageSetup -> RawPageSetup

I should also mention that as an implementation issue I added classes

class ToDocument a where
  toDocument :: a -> Document

class ToElement a where
  toElement :: Name -> a -> Element

class ToAttrVal a where
  toAttrVal :: a -> Text

and supporting functionality (see Codec.Xlsx.Writer.Internal). I use these classes throughout to make the translation from Haskell types to Excel's XML types very uniform. I have not modified the existing code at all to make use of these classes, but you certainly could. If you wanted to add parsing functionality for all the higher level types I'd recommend adding classes for the opposite direction, too.

All this stuff has been tested with GHC 7.6 - GHC 7.10 (I think I tried 7.4 but some dependency failed to install, IIRC). The output has been tested with Excel 2007, Excel 2010, Excel Online, as well as OpenOffice/LibreOffice, Google Docs, and Gnumerics. Not all functionality works in all applications, but most does, and everything that I've tested works in Excel proper. Note however that I've tried to be pretty exhaustive in my definitions of the parts of the specification that I did implement, but I did not test each and every combination of all new functionality. Finally, I've documented all high level with the documentation from the spec, as well as references to the spec so that if something isn't quite working, or isn't quite clear, it's easy to look up the relevant part of the specification.

This will close Issues #28, #29, #31, and #34.

qrilka commented 8 years ago

That's huge @edsko , hope I could review it properly over the weekend. As for ToXXX classes - I've seen those, looked interesting but I didn't check the complete details yet sounds like a good addition. How do you test it with all those Excels? Manually? And also - do you have anything else in your sleeve? :) Or I could start preparing 0.2 with it and possible later additions will go into subsequent versions?

edsko commented 8 years ago

Yes, manually, sadly. And no, I have nothing else planned for now. Sleeve is empty :)

qrilka commented 8 years ago

@edsko one more question - did you write all those stylesheet property definitions by hand or used something to automate the process?

edsko commented 8 years ago

By hand (just following the spec).

qrilka commented 8 years ago

@edsko I have reviewed the PR, the only other question is - why do we need those "raw" forms for sheet views and page setup? The structures seem to be comparatively small so I don't see what gain could we have from postponing their parsing/rendering but API requires an extra step. Maybe I miss some other important argument?

edsko commented 8 years ago

The most important reason is that I didn't want to write parsers. By keeping the raw representation in the core datatypes it's clear to the user of the library that we make no attempt to parse things like page setup or sheet views. (For the case of stylesheets, there is also the additional reason that the definition is not completely exhaustive, so it's conceivable that someone might want to bypass the structured stylesheet and define their own -- this is especially relevant when editing spreadsheets previously generated by Excel. Not that we guarantee any kind of roundtrip property as things stand of course.)

Alternatively we can make the structured types part of the core, and just add a comment for now that when you read an Excel stylesheet you will get a dummy value here. I'm fine with either approach. (Or you could actually write the parser, of course.)

qrilka commented 8 years ago

With incomplete and potentially large stylesheets it sounds understandable but sheet views and page setup are much smaller and they are in the same file. What do you think If I'll merge your version, then remove raw sheetviews/pagesetup checking their parsers before publishing new library version on hackage? Thus master will have your version of API with 2 extra "raw" values for some (hopefully short) period of time but that API won't get published. And you will need to update your code when those raw values will be removed. What do you think? Also I think that there is no special need for you to review PR with references updated for the 4th edition so if you are OK with my proposal then I'll merge this PR and fix references in later commit.

edsko commented 8 years ago

Yup, that all sounds perfect.

qrilka commented 8 years ago

BTW @edsko please let me know if you'll notice something bad in parser/removing raw nodes in https://github.com/qrilka/xlsx/commit/fa2dc3f01d33bb6c3e5644c82978503cbb2896f7 or maybe in some previous commits. And just in case - I want to move modules a bit also so all "standalone" data structures will go into Codec/Xlsx/Types/ directory (but will be reexported from Codex.Xlsx.Types)

edsko commented 8 years ago

@qrilka Sorry for the delay, I got moved to a different project and didn't get a chance to look at this. Just updated my code to compile against your master branch and it all looks good.

qrilka commented 8 years ago

thanks a lot for all your efforts @edsko