jmcnamara / libxlsxwriter

A C library for creating Excel XLSX files.
https://libxlsxwriter.github.io
Other
1.51k stars 333 forks source link

Support for Dynamic Array Functions like FILTER #311

Closed Robert-M-Muench closed 3 years ago

Robert-M-Muench commented 4 years ago

Looks like the new dynamic array functions result in a bit different XML then before. Any chance to add these?

xmllint --format bsh/xl/worksheets/sheet3.xml | rg "filter"
        <f t="array" ref="E2:T326">_xlfn._xlws.FILTER(... <f>
jmcnamara commented 4 years ago

That is already supported. You just need to prefix the formula with _xlfn._xlws.

This is explained, somewhat, in this section of the docs: Formulas added in Excel 2010 and later.

For example:

#include "xlsxwriter.h"

int main() {

    /* Create a new workbook and add a worksheet. */
    lxw_workbook  *workbook  = workbook_new("array_formula.xlsx");
    lxw_worksheet *worksheet = workbook_add_worksheet(workbook, NULL);

    worksheet_write_array_formula(worksheet, 3, 5, 18, 8, "_xlfn._xlws.FILTER(A5:D20,C5:C20=H2,\"\")", NULL);

    /* It can also be written like this: */
    /* worksheet_write_array_formula(worksheet, 3, 5, 18, 8, "{=_xlfn._xlws.FILTER(A5:D20,C5:C20=H2,\"\")}", NULL); */

    workbook_close(workbook);

    return 0;
}

Output:

Screenshot 2020-11-05 at 11 38 55

Robert-M-Muench commented 4 years ago

Yes, I found this out after digging a bit around. Pretty cool. However, there is a little difference, as you can see:

image

In the upper case, you see that FILTER is not an array function. And I use a dynamic changing filter clause, which filters a list of 300+ rows down to the ones you see.

In the lower case, you see the effect written with libxlsxwriter, which uses an array function, which leads to all the "#NV" errors for rows that don't fit the filter criteria.

I took a look at the XML but can't find any difference between both versions. Maybe in the upper case, the cell->type is something different than what libxlsxwriter writes out (ARRAY_FORMULA_CELL):

image

Is there a way to write such an array formula without the {...} characters/type which indicates a (static) array function?

Robert-M-Muench commented 4 years ago

The differences in the XML are for the row entries:

image

Left side is the one which doesn't produce the "#NV" errors.

jmcnamara commented 4 years ago

I'll look into it a bit more and let you know what I find. I have some ideas about what the issue may be but I need to test them out.

jmcnamara commented 4 years ago

I think the issue (or one of the issues) is that the cell contains a reference to a Metadata file that contains additional information about the formula.

For example here is the cell and formula elements from a file similar to the one created in the example above but created in Excel:

      <c r="F4" cm="1">
        <f t="array" ref="F4:I19">_xlfn._xlws.FILTER(A5:D20,C5:C20=H2,"")</f>
        <v>0</v>
      </c>

The difference between this and libxlsxwriter generated file is the cm="1" attribute. Removing this from the Excel file creates a {=FILTER()} style formula when opened in Excel.

The cm reference is to the xl/metadata.xml file.

This, or a secondary issue, may be the reason behind the #NV values. Could you attach the Excel file shown above and the libxlsxwriter equivalent to the GitHub issue and I'll dig into it a bit deeper. Or else email them to jmcnamara@cpan.org.

Robert-M-Muench commented 4 years ago

Example XLSX files send to you.

jmcnamara commented 3 years ago

Thanks for the files.

The bad news is that if you remove the cm=n reference from the Excel generated cell attributes the formula stops working as expected and starts to behave like the libxlsxwriter file. It still works for the original area but the {=FILTER()} is present and if you change the "Range to" value the displayed cell range doesn't change. And if the new range is smaller than the old range the #N/A values return. I'll send you the file for reference.

So unfortunately this won't be fixable without adding the metadata.xlm file, the metadata elements, the cell linkage and the .rels and Content_Types additions. That is quite a bit of work which doesn't have a good return on investment for such a small range of new dynamic array formulas. So, unfortunately, for now, this is probably a won't fix.

From an implementors point of view this is a rather disappointing change in the way formulas are handled in the file format. This is the first set of functions in 13 years that Excel didn't just load and re-evaluate directly from the worksheet.xml file.

Robert-M-Muench commented 3 years ago

The #N/A is ugly but would work in my use-case.

I think the dynamic array functions will become more and more mainstream because they are very powerful. And that MS did implement them in such a complicated way might be a hint, that they just don't want to let others use them...

How about estimating a sponsoring for this feature?

jmcnamara commented 3 years ago

I'll try look into implementing it over the Christmas holidays.

Robert-M-Muench commented 3 years ago

Great! What sources do you use to dig through the XLSX file-format?

jmcnamara commented 3 years ago

What sources do you use to dig through the XLSX file-format?

I mainly just unzip the files and diff them. I use a small Perl program to unzip the xlsx and fix some of the xml/vml files to make them more suitable for diffing: such as sort .rel entries. I then use Araxis Merge for a visual diff because it splits the single line xml data into multiple (virtual) lines which make the comparisons easier. I occasionally use xmllint --format for local inspection. And I sometimes refer to ECMA-376-1 Office Open XML File Formats — Fundamentals and Markup Language Reference specification (like in this case to see what the cm attribute was).

Nothing very fancy or complicated. :-)

jmcnamara commented 3 years ago

Closing this and merging it with #327