qrilka / xlsx

Simple and incomplete Excel file parser/writer
MIT License
132 stars 64 forks source link

Correctly parse shared strings in the streaming implementation #178

Closed sergesku closed 5 months ago

sergesku commented 5 months ago

At the moment Codec.Xlsx.Parser.Stream.parseSharedStrings implementation incorrectly parses shared strings values: it creates a separate values for each formatted piece of text. This creates skewed data in the result.

Here is an example of failing test:

textA1 = "Text at A1"
firstClauseB1 = "First clause at B1;"
firstClauseB2 = "First clause at B2;"
secondClauseB1 = "Second clause at B1"
secondClauseB2 = "Second clause at B2"

cellRich :: Text -> Text -> CellValue
cellRich firstClause secondClause = CellRich
  [ RichTextRun
      { _richTextRunProperties = SomeProperties
      , _richTextRunText = firstClause
      }
  , RichTextRun
      { _richTextRunProperties = SomeOtherProperties
      , _richTextRunText = secondClause
      }
  ]

richWorkbook :: Xlsx
richWorkbook = def & atSheet "Sheet1" ?~ toWs
  [ ((RowIndex 1, ColumnIndex 1), cellValue ?~ CellText textA1 $ def)
  , ((RowIndex 2, ColumnIndex 1), cellValue ?~ cellRich firstClauseB1 secondClauseB1 $ def)
  , ((RowIndex 2, ColumnIndex 2), cellValue ?~ cellRich firstClauseB2 secondClauseB2 $ def)
  ]

        Expected:
        fromList
          ["First clause at B1;Second clause at B1",
           "First clause at B2;Second clause at B2", "Text at A1"]

        Difference:
        2,3c2,3
        <   ["First clause at B1;Second clause at B1",
        <    "First clause at B2;Second clause at B2", "Text at A1"]
        ---
        >   ["First clause at B1;", "First clause at B2;",
        >    "Second clause at B1", "Second clause at B2", "Text at A1"]

This PR fixes this behavior. Now we parse a shared string as a single cell value.

P.S. We still create CellValue as CellText, not as CellRich.

qrilka commented 5 months ago

Thanks @sergesku I'll take a look into it soon.

qrilka commented 5 months ago

P.S. We still create CellValue as CellText, not as CellRich.

Do you mean that for the streamed parser? Would you mind adding that as a know TODO? Probably to the function you changed The PR looks good

sergesku commented 5 months ago

Do you mean that for the streamed parser?

Yes. Thus, if we have a cell with a Rich Text value in an a file, at the exit from the stream parser we have a regular CellText value.

qrilka commented 5 months ago

Thanks @sergesku for your contribution I plan to put this on Hackage after #177 will land on master.