n-way merge of odf documents

GoogleCodeExporter commented 9 years ago

Amendments are approved on a bill in individual documents submitted by an MP.

Assuming that there are no overlapping merges (a check needs to be added for 
this) -- this 
involves a merge of 'n' different odf documents into 1 document, which will be 
called the 
consolidated bill document.

Original issue reported on code.google.com by ashok.ha...@gmail.com on 14 Apr 2010 at 8:51

GoogleCodeExporter commented 9 years ago

JNDiff has a n-way diff and merge algorithm ... to check if it suits the above 
scenario

http://jndiff.sourceforge.net

Original comment by ashok.ha...@gmail.com on 14 Apr 2010 at 8:54

GoogleCodeExporter commented 9 years ago

demo for a licensed product -- 

http://opendocument.deltaxml.com/free/demo/odt/merge/four-editors/

Original comment by ashok.ha...@gmail.com on 20 Apr 2010 at 10:49

GoogleCodeExporter commented 9 years ago

Also el4j
http://sourceforge.net/projects/el4j/files/
http://www.javaworld.com/javaworld/jw-07-2007/jw-07-xmlmerge.html

Also using StaX
http://stax.codehaus.org/Home
http://www.devx.com/ibm/Article/20269

Original comment by ashok.ha...@gmail.com on 20 Apr 2010 at 11:04

GoogleCodeExporter commented 9 years ago

merge process has 2 parts 
 1) safeguard - check for overlapping changes - dont merge, present error to user
 2) n-way merge of odf documents to present the final merged odf

Original comment by ashok.ha...@gmail.com on 20 Apr 2010 at 11:55

GoogleCodeExporter commented 9 years ago

comparison of currently available xml diff mechanisms
http://www.scribd.com/doc/14482474/XML-diff-survey

Original comment by ashok.ha...@gmail.com on 20 Apr 2010 at 9:28

GoogleCodeExporter commented 9 years ago

TO DO :

test the google diff-match-patch library with odf track changes

http://code.google.com/p/google-diff-match-patch/

Original comment by ashok.ha...@gmail.com on 21 Apr 2010 at 9:49

GoogleCodeExporter commented 9 years ago

Also test diffxml :

http://sourceforge.net/projects/diffxml/files/diffxml/

Original comment by ashok.ha...@gmail.com on 21 Apr 2010 at 9:49

GoogleCodeExporter commented 9 years ago

Original comment by ashok.ha...@gmail.com on 4 May 2010 at 11:29

Changed state: Started

GoogleCodeExporter commented 9 years ago

We use a hybrid mechanism of xml parsing and recording xml fragments in a db to 
do the merge of  n xml documents into 1 document.

the use of the db reduces the memory requirements for inmemory processing of 
xml.

the basic logic of the merge works as follows -- 

the primary assumptions 

-- the odf header and the odf content body are merged independently.  This is 
because a track change mark in ODF adds header entries to the the 
<text:track-changes> container in ht e ODF content header.  
Merging the header and body as one unit  would have thus required a node level 
merge and synchronization between change entries and header. treating them 
independently makes the merge much simpler.

-- there are no overlapping merges. the identification of overlapping merges is 
done by an exception handled case of the merge process (i.e. the merge 
failed...)

merge process -

 - we iterate through the 'n' changed documents. change info is extracted and recorded in a db. node address for each change is recorded, and the order of the change is also recorded (1, 2, 3 ...)

for the content body -- 
 - change nodes are processed for the 'n' documents starting with the lowest order number
    -- the node addresses of the change node are compared to identify the shallowest one i.e. which is the first with respect to the original document [1]
    -- the shallowest node[1] is handled first and all the preceding:: nodes to the shallowest node are captured and streamed into a xml document [2]
    -- the shallowest node itself is streamed into the incremental xml document [2]
    -- the next shallowest node[3] of the 'n' document is handled next and the same process is repeated, except that only the preceding:: nodes upto the end of the [1] node are streamed into the xml document.

for the content header -- 
 - the content header is a simpler header-detail xml merge scenario. a ready made tool like diffxml will be used for this.

Original comment by ashok.ha...@gmail.com on 6 May 2010 at 8:47

GoogleCodeExporter commented 9 years ago

To compare node order :

[Compare node order
http://code.google.com/p/doctype/wiki/ArticleNodeCompareDocumentOrder]

Original comment by ashok.ha...@gmail.com on 6 May 2010 at 1:40

GoogleCodeExporter commented 9 years ago

Node order is compared using compareDocumentPosition()

Original comment by ashok.ha...@gmail.com on 6 May 2010 at 1:53

GoogleCodeExporter commented 9 years ago

<text:change-start> <text:change-end> can encompass whole sections and tables.

we need to collpase an insert change extract it out temporarily and replace it 
with a
marker, process the merge and then replace the marker back with the extracted 
text (xml).

Original comment by ashok.ha...@gmail.com on 6 May 2010 at 3:21

GoogleCodeExporter commented 9 years ago

A more efficient approach is to group changes by the parent node containing the 
change.
Since the parent nodes always exist in the parent document -- the parent node 
groupings can be ordered by 
using compare document position in the original document. Node change 
processing can then be localized to 
within the parent node groupings.

Original comment by listmans...@gmail.com on 9 May 2010 at 7:00

GoogleCodeExporter commented 9 years ago

Tested for 2 way insert merge.

Inserts add sections to different parts of the document.

Preceding, following incrementatl document change is captured in the db and on 
the
file system.

To Do :
-------

 - Build merged document from extracted parts 
 - Test more complex insert scenarios
 - Add logic for delete scenarios
 - Check for overlaps

Original comment by ashok.ha...@gmail.com on 12 May 2010 at 4:01

GoogleCodeExporter commented 9 years ago

Setting milestone current issues

Original comment by ashok.ha...@gmail.com on 14 May 2010 at 11:20

Added labels: Milestone-ImmediateTerm

mariarahat / bungeni-editor

n-way merge of odf documents #73