Development ideas - Githubissues

jmacadie commented 8 years ago

Hi,

I work building (big) Excel financial models and source control is a nightmare. I love git (for small personal coding projects) but it doesn't work with Excel, as we all know. I came across your project yesterday after following a link from http://programmaticallyspeaking.com/git-diffing-excel-files.html and it's a bit of a revelation for me.

I've been giving some thought to how I'd approach making your tool more functional for the types of spreadsheet I work with and below are my high level thoughts on how the tool could be structured. The key idea being to track transformations first and then cell diff (hopefully) aligned cells. I know you have this on your to do list and to my way of thinking tracking cell transformations requires quite a different program structure, which as well as being difficult is possibly why you've not got round to it yet.

I know nothing about java and am not a trained programmer but am happy to try to help out (day-job workload permitting) if you're keen. Alternately if the changes are too radical/not where you want to go I could just fork you and go my own way. Let me know.

Anyway here's my thoughts:

Maintain 3 spreadsheets

From spreadsheet (immutable)
To spreadsheet (immutable)
Working spreadsheet, a clone of the From spreadsheet to start with but is gradually edited by applying the changes identified in to two transform sections below

FIND TRANSFORMATIONS idea is to repress high-volume cell difference reporting that can be easily explained by the single transformation that created the bulk changes

SHEET TRANSFORMATIONS

(REPORT) Sheets re-ordered
(REPORT) Sheet renamed
(REPORT) Sheet copied
(REPORT) Sheet deleted

CELL TRANSFORMATIONS

(REPORT) Rows added
(REPORT) Rows deleted
(REPORT) Rows cut & pasted (i.e. moved)
(REPORT) Rows copied & inserted (i.e. duplicated)
(REPORT) Columns added
(REPORT) Columns deleted
(REPORT) Columns cut & pasted (i.e. moved)
(REPORT) Columns copied & inserted (i.e. duplicated)
(REPORT) Cells deleted (shift left)
(REPORT) Cells deleted (shift up)
(REPORT) Cells inserted (shift left)
(REPORT) Cells inserted (shift down)
Possibly need to run this loop 2 (or more) times per sheet to catch instances where, for example, columns have been inserted and then rows

FIND CELL DIFFERENCES Can now compare the cells in the Working spreadsheet to the To spreadsheet

Need a method to translate the cell address through the transformations to infer the original cell reference in the From spreadsheet (i.e. reverse apply the transformations) for reporting

FORMULAE CONDENSE

Condense the sheet down to just unique formulae on the sheet and the blocks of cells that all share the same formula
This is important as in big Excel files it is common practice to fill formulae. For example a spreadsheet might have 500 columns and each row will have a filled-across formula. If the diff reports on each of the 500 differences (per edited row) then you'll spam the output and not be able to easily determine what's changed
In VBA you'd do this by comparing the FormulaR1C1 property of the cells for equivalence
Output should be a paired list of (block) cell references and the formula in the top, left cell of the block

CELL DIFFERENCES

(REPORT) Where cell blocks with matching formulae have been extended or reduced
(REPORT) Where matching cell blocks have different formulae
(REPORT) Where non-matching cell blocks with different formulae (need to think about this one a bit)

jmacadie commented 8 years ago

Oh and the other thing I meant to mention about this program structure and transformations is that you don't have to apply any of the transformations to start with. The reported cell differences stage at the end is a catch-all and will work as your current program does to report any instance of the same cell location having a different formula. As you go along you can fill in each element of transformation finding logic and it will act to suppress the reported cell differences in the final stage.

na-ka-na commented 8 years ago

Thanks for writing James.

Yes transformations has been on the TODO list for a long time. Back when I was developing the tool for my use case, the way I got around it was first manually transforming the sheet and then feeding it into ExcelCompare.

I didn't do it then because, as you point out - they are not easy. Now that I don't use this tool actively - I only spend time maintaining it and small feature requests. So contributions are very welcome. Trust me Java is a simple language :), especially the subset of it I use in this project.

We can do transformations 2 ways - 1) automatic detection, or, 2) manual input. I think automatic detection, though sexy, may be a can of worms which might consume lots of programming time, not to mention compute resources. Even then it might come back with underwhelming results, or even "wrong" results which the user doesn't expect. Manual input of transformations will work, though it requires coming up with a way of expressing them. I would start with manual input and attempt automatic only as an orthogonal project.

Say if we decide to go with manual input - we take in the user input for transformation (for e.g. if a column and row was added - they can say shift column K right and row 10 down). Do all the transformations in memory after reading the Excel file and then do the diff as today.

Regarding formulae, currently I do only diff formulae for cells with formulae. But someone wanted a feature request to ignore formulae and diff values as they constantly changed formulae but wanted to see what values where changing. So there are --diff_numeric_precision, --diff_ignore_formulas flags.

Thanks.

jmacadie commented 8 years ago

I've been doing a bit of work on my ideas. Being a total newb to java I downloaded netbeans for an IDE and it's apparent you weren't using netbeans for the current application. I'm sure porting is pretty simple for anyone with much experience but I was struggling to get it to build with external libraries etc. I also found I was struggling to get my head round what all your code was doing. Finally my ideas for development are structurally quite different to your current project.

For all of the above reasons I found it easier to start from scratch with my own project. After (not much) thought I've come to appreciate just how hard detecting transformations is going to be. I've therefore been focusing on the formulae condense idea in my original thought dump, and had some success. The idea being to condense a given sheet down into a paired list of unique formulae and all the cells that share that same formula. This will be a real boon for reporting on effective differences within spreadsheets where fill down / right has been heavily used (as is the case on most of the files I work with).

The biggest issue I've had to deal with so far is that POI doesn't do R1C1 notation (https://poi.apache.org/apidocs/org/apache/poi/ss/formula/OperationEvaluationContext.html & search page for R1C1) and that's pretty much a pre-requisite for grouping like formulae, so I've had to roll my own.

Anyways feel free to take a look: https://github.com/jmacadie/ExcelCompare2 and ping me any thoughts. The project is pretty nascent as yet and doesn't get as far as actually doing a diff. Also any pointers on how to improve my super basic java writing style gratefully accepted.

James

jmacadie commented 8 years ago

Done a bit more work and got my first diff out! LOTS of work to solve just one problem of grouping like cells together.

I think I'm probably going to keep my new project separate from yours as the underlying program structure is quite different and I think it would hard to integrate without simply replacing lots of your code base.

As ever, any comments on what I'm doing gratefully received.

na-ka-na commented 8 years ago

Hey I went through your code at a high level and I could understand and like your approach :-). Seriously impressed with your regex foo in convertToR1C1. I guess at that point I might have jumped to writing a full fledged grammar to parse it. Overall code wise, it shows that you are new to Java, welcome! For e.g. to keep a running sorted set of formulae, you could look into java.util.TreeSet.

I am curious to run your code. Do you mind sharing a sample excel that you work with regularly? I wanted to compare ExcelCompare and ExcelCompare2 output. Thanks for pointing me to your code!

On Mon, Aug 22, 2016 at 7:59 AM, James MacAdie notifications@github.com wrote:

Done a bit more work and got my first diff out! LOTS of work to solve just one problem of grouping like cells together.

I think I'm probably going to keep my new project separate from yours as the underlying program structure is quite different and I think it would hard to integrate without simply replacing lots of your code base.

As ever, any comments on what I'm doing gratefully received.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/na-ka-na/ExcelCompare/issues/38#issuecomment-241440996, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQBhK4UFSlCx-FitoEDxGkWpbQVpOzrks5qibligaJpZM4Ji6pG .

jmacadie commented 8 years ago

Thanks for the feedback, will definitely take a look at TreeSet.

The gibhub repo was a bit behind my local as I'm having massive issues pushing from my work laptop (I'm behind a corporate Windows firewall that blocks anything that moves; and quite a few dead things for good measure). I can't recall if I'd pushed a really simple set of test workbooks but anyway I have now so that the main routine of the program will automatically reference the attached resource workbooks in the project.

I'm not sure about comparing the output just yet as my project really doesn't make any attempt to output the text in a helpful format: it's more of a proof of concept at this stage. Not least I still haven't been bothered to loop through all sheets in a workbook to actually do a full comp. I think the full plan is to have some some of diff object and then pass that to a routine that will report a diff object or maybe have reporting as a method of the diff object. However I have tested the code (watching variable when code is broken) on a bunch of workbooks and it seems to be working reasonably well.

The reason I've not attached a bigger workbook, and the sort of thing I'd actually use, is that I feel we need to be detecting transformations; principally rows / columns inserted / deleted. Without this you're going to get literally thousands of false positive differences. The sorts of workbooks I use tend to have thousands of rows and hundreds of columns. The grouping I've put in is pretty essential for me as to manage a file like that you have to stick to row consistency (i.e. same formula filled across EVERY row) but a row inserted at the top of the sheet will still trigger a needless difference report on every subsequent row, which as noted can run to thousands.

I know you're against this but I'd like to have a go and see what I can do on transformations. From my end if I don't get this working then this is never going to be that useful to me whereas if we can figure out transformation detection we've got something very useful on our hands. My, very rough, plan is to design a row / column similarity measure and then say that anything that is more than [95%] similar is really the same row / column and backwards construct the transformations that must have happened. You can adjust that percentage to find more transformations at the cost of maybe finding spurious matches. I need to have a bit more of a think about what an insert and delete does to the structure of formulae on a sheet.

In the last week I've mainly been back-filling unit tests, which uncovered a number of bugs in the process, so it's been useful. Just got CondensedFormulae to go but it's such a high-level object it's getting a bit painful to construct all the fake universes I want to test.

jmacadie commented 8 years ago

A bit more work and I've got my first row insert detection working! It doesn't work for deletes yet though :(, a matter of time...

It works by building up a map of which rows & cols map to one another in the before and after workbooks and then (not yet written) subsequently using this map to work out where the inserts and deletes are.

If you're interested, the current repo is set up to show a row insert detection (via the map object). You'll need to debug the program and inspect the map object though as I don't do anything with it yet.

jmacadie commented 7 years ago

Just a quick update. I've got the whole thing working now so it does translation detection, lines the before and after sheets up (i.e. accounts for the translations) and then figures out what's really changed.

It works on my very simple default test case but I've yet to try the program out on real life examples so I'm sure there are bugs to be found (as well as limitations like whole column / row formulae that I'm already aware of). I've also got to loop back and fill in all the unit tests.

However, if you were interested, now's a good time to take a look.

na-ka-na / ExcelCompare

Development ideas #38