handsontable / hyperformula

HyperFormula is an open-source headless spreadsheet for business web apps. It comes with over 400 formulas, CRUD operations, undo-redo, clipboard support, and sorting.
https://hyperformula.handsontable.com/
Other
1.98k stars 109 forks source link

35 seconds to build graph for non-trivial workbook #872

Open rackuka opened 2 years ago

rackuka commented 2 years ago

Description

For non-trivial workbook (10+ worksheets, multiple rows and formulas referencing cross worksheets + named expressions) it takes hyperformula about 30-35 seconds to build dependency graph.

Steps to reproduce

  1. Get xlsx workbook (not able to publicly share my sample due to NDA and it is timeconsuming to reproduce it. most value is in math model, not numbers...)
  2. Translate xlsx into dictionary of array of arrays + load all named expressions. Use any library which is able to read xlsx format (XLSX from SheetsJS, ExcelJS, etc)
  3. do buildFromSheets (with default options)
  4. Measure time of buildFromSheets execution.

Demo

Here is profiling result: https://drive.google.com/file/d/1A9hGck4U8BdXs_dJ3_jllLH51qDVDSEr/view?usp=sharing

Your environment

Mac OS X, 2.5 GHz Quad-Core Intel Core i7, 16 GB 1600 MHz DDR3

Links

AMBudnik commented 2 years ago

Thank you for reporting the issue @rackuka

If there is something we can do to speed up building the graph, we should try to cut that 35 seconds. My colleague or I will keep you updated about the progress.

Ps. Can I ask you to upload the attached report to any supported GH format? That way we will have it available on Github (without Google being involved).

rackuka commented 2 years ago

vscode-profile-2021-12-08-15-47-25.cpuprofile.zip

AMBudnik commented 2 years ago

Thank you @rackuka

bardek8 commented 2 years ago

@rackuka, indeed, building the graph is the most time-consuming part of processing the data by HyperFormula. We tried to implement it in as effective way as possible, but we are aware this might be the bottleneck for some big, non-trivial cases.

Have you seen the hints mentioned here: https://github.com/handsontable/hyperformula/discussions/871 ? I believe that the first of them - an attempt to rearrange dependencies between the cells of the workbook - would give more immediate effects in your case than optimizing the code. Maybe you can find formulas that operate on big ranges, which is not always necessary? HyperFormula needs to process all the dependencies between the cells and will certainly benefit from the optimizations from the workbook creator, who knows well properties of data and the functions used.

Some example that comes to my mind is that you can try to rearrange columns for VLOOKUP, so that there are no irrelevant cells in-between. For example in VLOOKUP("abcd",B1:U10,8) one may think that there are only 20 cells that should be considered here, but in the current implementation we don't optimize it and we will process all the 200 cells.

Btw, how big (number of rows and cells) is your example?

rackuka commented 2 years ago

Hi.

Rearranging formulas at some point will be out of my influence. Hence I was looking for performant engine which could keep up in a reasonable time (100 ms). Definitely the reducing amount of look ups boosts the performance.

To your question - there are 15 worksheets. The “heaviest” has 250 rows by 15 columns with index match in each of them.

Alexei

On Dec 27, 2021, at 11:28 AM, Bartek Dudek @.***> wrote:

@rackuka https://github.com/rackuka, indeed, building the graph is the most time-consuming part of processing the data by HyperFormula. We tried to implement it in as effective way as possible, but we are aware this might be the bottleneck for some big, non-trivial cases.

Have you seen the hints mentioned here: #871 https://github.com/handsontable/hyperformula/discussions/871 ? I believe that the first of them - an attempt to rearrange dependencies between the cells of the workbook - would give more immediate effects in your case than optimizing the code. Maybe you can find formulas that operate on big ranges, which is not always necessary? HyperFormula needs to process all the dependencies between the cells and will certainly benefit from the optimizations from the workbook creator, who knows well properties of data and the functions used.

Some example that comes to my mind is that you can try to rearrange columns for VLOOKUP, so that there are no irrelevant cells in-between. For example in VLOOKUP("abcd",B1:U10,8) one may think that there are only 20 cells that should be considered here, but in the current implementation we don't optimize it and we will process all the 200 cells.

Btw, how big (number of rows and cells) is your example?

— Reply to this email directly, view it on GitHub https://github.com/handsontable/hyperformula/issues/872#issuecomment-1001429604, or unsubscribe https://github.com/notifications/unsubscribe-auth/APFG6LRD2KV7AIBBJVL7EW3UTAPSRANCNFSM5JWRAOLQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.

sequba commented 2 years ago

When completed, please re-verify https://github.com/handsontable/handsontable/issues/8118.