qrilka / xlsx

Simple and incomplete Excel file parser/writer
MIT License
130 stars 64 forks source link

Replace Double in CellValue by Scientific #177

Open olafklinke opened 4 months ago

olafklinke commented 4 months ago

The XML contains decimal values, which are accurately represented by Scientific but not by Double as in the CellDouble constructor of CellValue. This PR makes CellDouble a pattern synonym and replaces the old constructor by a new constuctor CellDecimal which holds a Scientific value. For motivation, see this issue.

olafklinke commented 4 months ago

Fixed the test suite. Note that the code base may have some CellDouble patterns lurking, which do an implicit Scientific -> Double conversion where it may be unreasonable. Sadly the one-directional pattern synonym can not make the change in representation completely transparent. (That is, Haskell currently does not allow writing a bi-directional pattern that uses a different conversion function each way.) We might want to come up with a good test that exhibits the rounding errors in using the CellDouble pattern. Further I am not an expert at writing prisms, nor lens idiosyncrasies at all, so please feel free to edit my PR so that all required lenses are present. Otherwise I guess this PR would introduce an unnecessary amount of breakage in downstream packages.

qrilka commented 4 months ago

Would you mind checking my comments on the PR? And it would be great to add the test you proposed as well. As for unforseen corner case - let's take a chance and fix problems when we see some.

qrilka commented 4 months ago

@olafklinke it looks like the microlens flag is a bit broken

olafklinke commented 4 months ago

Ah yes, the manually defined _CellDecimal prism conditional on the microlens flag was missing. This is fixed in 6c9ddcb. Further there are two warnings about incomplete pattern matches in src/Codec/Xlsx/Writer/Stream.hs:346 and src/Codec/Xlsx/Writer/Stream.hs:357 but these are actually covered by the CellDouble view pattern. Therefore I added a COMPLETE pragma.

qrilka commented 4 months ago

Thanks for fixing that, could you also address my other comments? BTW it would be neat to squash the commits

olafklinke commented 4 months ago

could you also address my other comments

What is left unaddressed? Changing CacheNumber to have a Scientific argument is not trivial, because in Codec.Xlsx.Writer.Internal we would need a ToAttrVal Scientific instance, but the txtd function that is used for the Double instance relies on RealFloat which Scientific is not. Hence we need a better conversion Scientific -> Text.Builder first. We could use the one provided by scientific but I have no idea whether that conforms to the XML standard. Would that be covered by the test suite?

qrilka commented 4 months ago

This looks like this uncovers the problem with the PR - it doesn't fix the writer as a result saving and reading will be not idempotent :-\ Looks like OOXML spec doesn't specify anything about decimals in cell values :( But for cached fields it's

<xsd:attribute name="v" use="required" type="xsd:double"/>

so I suppose we could use it as a guide here. Does this sound fine with you? Also we need to add a decimal which is not representable with the current xlsx version and which will make writer/read not idempontent in this PR.

Also let's add a bound on scientific

olafklinke commented 4 months ago

There are two idempotency issues here: The immediate Haskell idempotency issue is that Data.Text.Read.rational and Data.Text.Lazy.Builder.scientificBuilder need to be mutually inverse. That is within our power. The second idempotency issue is whether other OOXML-processing software can change the representation of decimal numbers so that it trips up Data.Text.Read.rational, or whether the text representations produced by Data.Text.Lazy.Builder.scientificBuilder adhere to the XML standard. (I think they do, it is quite liberal.) Indeed the standard for xsd:double states that the value range must adhere to IEEE-754 which Scientific can easily surpass. Conversely, Double has values that are not representable in Scientific, e.g. infty = 1/0 :: Double which fromFloatDigits maps to 1.797693134862316e308.

In fact, even the current published version of xlsxdoes not adhere to the standard, since Data.Text.Read.rational fails on the texts "NaN" and "INF" which are legal xsd:double according to the spec.

Summary: If the OOXML standard really forces all decimals in XLSX to be IEEE-754 doubles, then #176 has no basis, as all double precision floating point numbers do have an exact representation in scientific notation, and a slightly improved rational parser would have us covered for the NaN and INF corner cases.

qrilka commented 4 months ago

As I said in the other doc it seems that I was extrapolating too much, cell values are not xsd:double but cached numbers are. Ok, in this situation I'd propose to leave CacheNumber as is and make read . write == id including currently unrepresentable date values.