redgeoff / delta-pouch

Conflict-free collaborative editing for PouchDB
196 stars 13 forks source link

Support diff-patch? #42

Closed robertgartman closed 9 years ago

robertgartman commented 9 years ago

Hi,

Thanks for Delta-Pouch!

We’re heavy on Pouchdb & Couchdb and we’re at the point where we need to tune the performance. We consider moving from many small docs to fewer and larger. Delta Pouch could be the way to go. It seems though that in many cases our data model will not benefit from any ”delta” at all. We can end up with docs having hundreds of fields. It’s a hierarchical structure and arrays occur at several levels. The current implementation of Delta Pouch will, as far as our tests show, copy complex data structure every time. Regardless of any change within.

I guess it boils down to cost of detecting change and also the issue of how to communicate changes in the data structure. But this is what we’re looking for and I seek your input on extending Delta Pouch to cover the deltas based on something like diff-patch.

There seems to be a few options available;

RFC 6902 http://tools.ietf.org/html/rfc6902

RFC 7386 http://tools.ietf.org/html/rfc7386

Or something like; https://github.com/benjamine/jsondiffpatch

The demo over at http://benjamine.github.io/jsondiffpatch/demo/index.html provides a good visualisation of the diff-patch. Using the datasets in that demo, we can compare the Delta produced by different algorithms;

Let’s start with: https://github.com/benjamine/jsondiffpatch, which produces a ~0.7KB delta;

{
    "summary": [
        "@@ -638,17 +638,17 @@\n via, Bra\n-z\n+s\n il,  %0ACh\n@@ -916,20 +916,13 @@\n re a\n-lso known as\n+.k.a.\n  Car\n",
        0,
        2
    ],
    "surface": [
        17840000,
        0,
        0
    ],
    "demographics": {
        "population": [
            385742554,
            385744896
        ]
    },
    "languages": {
        "2": [
            "inglés"
        ],
        "_t": "a",
        "_2": [
            "english",
            0,
            0
        ]
    },
    "countries": {
        "0": {
            "capital": [
                "Buenos Aires",
                "Rawson"
            ]
        },
        "9": [
            {
                "name": "Antártida",
                "unasur": false
            }
        ],
        "10": {
            "population": [
                42888594
            ]
        },
        "_t": "a",
        "_4": [
            "",
            10,
            3
        ],
        "_8": [
            "",
            2,
            3
        ],
        "_10": [
            {
                "name": "Uruguay",
                "capital": "Montevideo",
                "independence": "1825-08-24T22:00:00.000Z",
                "unasur": true
            },
            0,
            0
        ],
        "_11": [
            {
                "name": "Venezuela",
                "capital": "Caracas",
                "independence": "1811-07-04T22:00:00.000Z",
                "unasur": true
            },
            0,
            0
        ]
    },
    "spanishName": [
        "Sudamérica"
    ]
}

Running the same data sets through http://chbrown.github.io/rfc6902/ gives us a 2.4KB delta;

{"op":"remove","path":"/surface"}
{"op":"add","path":"/spanishName","value":"Sudamérica"}
{"op":"replace","path":"/summary","value":"South America (Spanish: América del Sur, Sudamérica or \nSuramérica; Portuguese: América do Sul; Quechua and Aymara: \nUrin Awya Yala; Guarani: Ñembyamérika; Dutch: Zuid-Amerika; \nFrench: Amérique du Sud) is a continent situated in the \nWestern Hemisphere, mostly in the Southern Hemisphere, with \na relatively small portion in the Northern Hemisphere. \nThe continent is also considered a subcontinent of the \nAmericas.[2][3] It is bordered on the west by the Pacific \nOcean and on the north and east by the Atlantic Ocean; \nNorth America and the Caribbean Sea lie to the northwest. \nIt includes twelve countries: Argentina, Bolivia, Brasil, \nChile, Colombia, Ecuador, Guyana, Paraguay, Peru, Suriname, \nUruguay, and Venezuela. The South American nations that \nborder the Caribbean Sea—including Colombia, Venezuela, \nGuyana, Suriname, as well as French Guiana, which is an \noverseas region of France—are a.k.a. Caribbean South \nAmerica. South America has an area of 17,840,000 square \nkilometers (6,890,000 sq mi). Its population as of 2005 \nhas been estimated at more than 371,090,000. South America \nranks fourth in area (after Asia, Africa, and North America) \nand fifth in population (after Asia, Africa, Europe, and \nNorth America). The word America was coined in 1507 by \ncartographers Martin Waldseemüller and Matthias Ringmann, \nafter Amerigo Vespucci, who was the first European to \nsuggest that the lands newly discovered by Europeans were \nnot India, but a New World unknown to Europeans."}
{"op":"replace","path":"/demographics/population","value":385744896}
{"op":"replace","path":"/languages/2","value":"inglés"}
{"op":"replace","path":"/countries/0","value":{"name":"Argentina","capital":"Rawson","independence":"1816-07-08T22:00:00.000Z","unasur":true}}
{"op":"add","path":"/countries/2","value":{"name":"Peru","capital":"Lima","independence":"1821-07-27T22:00:00.000Z","unasur":true}}
{"op":"remove","path":"/countries/4"}
{"op":"remove","path":"/countries/8"}
{"op":"replace","path":"/countries/9","value":{"name":"Antártida","unasur":false}}
{"op":"replace","path":"/countries/10","value":{"name":"Colombia","capital":"Bogotá","independence":"1810-07-19T22:00:00.000Z","unasur":true,"population":42888594}}

I’ve only found a Java implementation on the new RFC 7386 so let’s leave that out for the moment. https://github.com/fge/json-patch

There are surely pros and cons with either algorithm. It seems reasonable to assume that jsondiffpatch will generate less bytes in general, with a drawback of being less friendly for the human eye. Another consideration with a diff from jsondiffpatch is that neither a db index nor technologies such as CouchDB -> ElasticSearch via river will play well. There are of course ways around these issues.

Would you be interested in tweaking your code to support different delta algorithms in a plugin-manner? Or would you rather recommend to implement this in a completely separate project?

Many thanks Robert

redgeoff commented 9 years ago

This is a interesting idea! Unfortunately, I think it is currently out of the scope of delta-pouch. I would recommend starting a new project for this.