vallettea / koala

Transpose your Excel calculations into python for better performances and scaling.
GNU General Public License v3.0
145 stars 59 forks source link

Pulling functions from Libreoffice #155

Open aodhan-domhnaill opened 5 years ago

aodhan-domhnaill commented 5 years ago

I noticed that there are a lot of function implementations missing here. Has any thought been put into pulling them from Libreoffice source? Perhaps finding a way to interface with the Libreoffice source as a dependency and strip out the functions we need.

Asking because I was interested in looking, but if someone has already done this, I'd rather not waste time.

aodhan-domhnaill commented 5 years ago

For example CONFIDENCE,

/formula/inc/core_resource.hrc gives a list of all the functions and their opcodes.

You can cross reference the OP code at /sc/source/core/data/funcdesc.cxx and the documentation for the function at /sc/inc/scfuncs.hrc

All those ops are found in /sc/source/core/opencl, for example OpConfidence. And they are generating OpenCL code it seems, so we could probably extract that with a simple execution and have a huge dump of operands.

Brad-eki commented 5 years ago

Hi @aidan-plenert-macdonald, I have been thinking about this for a while but from a slightly different perspective.

It is a bit of a problem not having all the formulas available and, yet, it is a lot of work to implement them. The work I do is essentially covered by the formulas supported by Koala, but I can accept this doesn't suit everyone.

There is a Java library called the Horrible SpreadSheet Format (HSSF) in Apache POI. The HSSF is mature and has been reading and evaluating Excel formulae for a while now. You can have a poke around in the code here: https://github.com/apache/poi/tree/trunk/src/java/org/apache/poi/ss/formula

I've not used Jython but it might be that we can Jython the Apache POI stuff...? Or maybe even wrap it.

Absolute worst case (and I don't advocate this), implement the POI SS formula stuff in Python. They do have a very well organised framework...

It would be absolutely magic if we can find a way to read and evaluate formulae as an extension to Pandas and Numpy. Literally in terms of inheritance style. But I fear that might be a little too much to ask for.

aodhan-domhnaill commented 5 years ago

Poi looks good. My reasoning behind Libreoffice is that their code base is OpenCL, so there is potential to be GPU supported, which seems to fall in line with the goal of Koala to generate high performance compute kernels for Excel macros.

That said, Poi is much cleaner than Libreoffice, so I imagine that would be easier to work with.

Perhaps a thin layer of abstraction is in order so Koala could swap out it's compute backend

EDIT:

Looking more into POI, it seems very mature. Personally as a "customer", I would like to see Koala have flexibility in compute options even if there is a default for Poi. I imagine someday I might want to swap in other solutions.

That said, Poi might be able to integrate with Spark

Brad-eki commented 5 years ago

Support for Spark would be sweet. I also like the idea of GPU support. They address different scales of data analysis and processing.

There are two dimensions on this... "command"/formula coverage and calculation efficiency. Many of the use cases that need broad support for the formulae like financial applications may not need blinding calculation speed. But the more focused scientific and engineering applications are likely to need the calculation efficiency. If those assumptions hold true, it's perfectly acceptable to wrap a library which is slow to execute but has broad command coverage but then be able to swap out the engine to capture efficient processing with a more limited set of commands.

I work in the energy sector and the environment I'm currently in is staunchly Pandas/NumPy. I know, through experience, I won't get any tools introduced here unless it can be used through the veil of Pandas+Numpy. As limiting as that is (eg; Spark-wise, and other awesome tech) I reckon being able to choose the calculation engine behind Koala would be a boon.

To hijack the conversation for a moment...

In recent times I have had contact with a project called xlwings which has an Excel plug-in that manages a COM connection, has an optional Python server, REST API to your Python code, User Defined Functions for Excel, seemingly SQL support and integrates Pandas, Numpy and matplotlib to a degree where you can insert matplotlib charts into Excel and, I believe, transform worksheets as dataframes. For my use-case (an accounting integration) so far it has proven to be as good as having Python as a native Excel language.

To my mind, what xlwings is missing is a formula reader and evaluation mechanism. The company I work for often has need to enable non-programmers to enter or alter calculations within a more complex model. (eg; parts of a model, not necessarily the entire thing).

My understanding of your comment on Koala's goal to generate high performance compute kernels for Excel macros is essentially in-line with what I'm looking for. I would like to see Excel formulae read from Excel and evaluated in Python on a slightly more interactive basis than using a spreadsheet as a template for defining a model for calculation (eg; load a spreadsheet and evaluate). I think I can see potential between Koala and xlwings to cobble together a situation where there's a "build my model" style button somewhere in xlwings and Koala (or something similar) reads Excel, can calculate results and use the freshly established model.

To then be able to choose the calculation engine and scale up to GPU processing and/or something like Spark... That would be a great thing.

EDIT: I've just learned xlwings already has a framework for integrating. There is the ability to write "converters", which is how xlwings integrates with NumPy and Pandas. So writing a converter to use Koala or similar is likely to work.