vinci1it2000 / formulas

Excel formulas interpreter in Python.
https://formulas.readthedocs.io/
European Union Public License 1.1
342 stars 75 forks source link

Find recursively "pure input" cells #142

Open nopria opened 2 months ago

nopria commented 2 months ago

I'm trying to disclose the computing structure of an excel file in order to reproduce and perform those computation from code. Basically I'd like to have a function able to find all the dependencies of a cell:

Once you know all the "pure input" cells, computing the value of a formula cell is easy with calculate method:

xl_model.calculate(
    inputs={"'[test.xlsx]WS'!M11":1.35, ... },
    outputs=["'[test.xlsx]WS'!M15"] # To define the outputs that you want to calculate.
)

The initilization is easy, formulas does a great job at creating a model and all information seem to be there:

import formulas
xl_model = formulas.ExcelModel().loads("test.xlsx").finish()
xl_model.calculate()
xl_model.to_dict()

When the inputs method of a cell returns None (as far as I understand) the cell does NOT contain a formula (like "pure input" cells), but as soon as I look at a cell containing a formula

c_M15 = xl_model.cells.get("'[test.xlsx]WS'!M15")
c_M15.inputs # OrderedDict([("'[test.xlsx]WS'!M10:N14", ["'[test.xlsx]WS'!M10:N14"])])

or

c_M10 = xl_model.cells.get("'[test.xlsx]WS'!M10")
c_M10.inputs # OrderedDict([("'[test.xlsx]WS'!H10", ["'[test.xlsx]WS'!H10"]), ("'[test.xlsx]WS'!K10", ["'[test.xlsx]WS'!K10"])])

I could not understand if and how I can use formulas methods to:

Thanks in advance for your attention.

dberardo-com commented 2 months ago

i dont know if this could be related to any of those: https://github.com/vinci1it2000/formulas/issues/119 https://github.com/vinci1it2000/formulas/issues/131

nopria commented 2 months ago

The script in #119 seems to go in the right direction, although, as already noted, does not work with ranges. In my case, it correctly returns

>>> get_one_line_formula(xl_model, "'[test.xlsx]WS'!M15")
'SUM((0.22 / 0.6):N14)'

With few changes and an helper function

import re

def expand_range(range_str,xl_model):
    '''
    Convert a string representing an Excel range like "'[test.xlsx]WS'!B3:D8" into a list of strings representing all single Excel cells of the range, filtering out cells unused in Excel document model `xl_model`.
    '''
    match = re.match(r"'([^']+)'\!([A-Z]+)(\d+):([A-Z]+)(\d+)", range_str)
    if match:
        sheet_name, col_start, row_start, col_end, row_end = match.groups()
        col_start_num = ord(col_start) - 65
        col_end_num = ord(col_end) - 65
        num_rows = int(row_end) - int(row_start) + 1
        cell_list = ["'{}'!{}{}".format(sheet_name, chr(65 + col), row)
                        for row in range(int(row_start), int(row_start) + num_rows)
                        for col in range(col_start_num, col_end_num + 1)]
         # cells really used in Excel document are only those in xl_model.dsp.dmap.pred dictionary keys
        return [c for c in cell_list if c in xl_model.dsp.dmap.pred]
    else:
        return None

import functools
import schedula as sh
@functools.lru_cache()
def pure_input_cells(xl_model, node):
    '''
    Knowing the model of an Excel document, find recursively the formula expressed in terms of "pure input" cells of a given formula.
    '''
    rl = expand_range(node,xl_model) # list of cells of a range, or None if argument is not a range
    if rl is not None:
        # even if in Excel parameters are separated by ';', in `formulas` model the separation character is ','
        return ','.join(pure_input_cells(xl_model, rc) for rc in rl if rc in xl_model.dsp.dmap.pred)
    try:
        formula = next(iter(xl_model.dsp.dmap.pred[node]))
    except StopIteration:
        return node
    for k in sorted(xl_model.dsp.dmap.pred[formula], key=lambda x: (len(x), x), reverse=True):
        if not isinstance(k, sh.Token):
            formula = formula.replace(k, pure_input_cells(xl_model, k))
    return formula[1:]

I was able to make it returns

>>> pure_input_cells(xl_model, "'[test.xlsx]WS'!M15")
"SUM(('[test.xlsx]WS'!K10 / '[test.xlsx]WS'!H10),'[test.xlsx]WS'!M11,('[test.xlsx]WS'!H12 * '[test.xlsx]WS'!K12))"

which express the formula of the input cell in terms of "pure input" cells expanding the range.

I hope a similar functionality will be added to the library core.