marcello3d / node-mongolian

[project inactive] Mongolian DeadBeef is an awesome Mongo DB driver for node.js
https://groups.google.com/group/node-mongolian
zlib License
349 stars 50 forks source link

Query Tree based data w/o sync? #78

Closed jdarling closed 12 years ago

jdarling commented 12 years ago

This is more a question than an issue, but if I have a collection of data that represents relationships between documents and I want to query it to create a tree of the nodes, what is the best approach? My data DOES have recursive loops (so you have to have a history on the branch so you don't go on forever), and contains bi-directional links at times (again history array).

Currently I query the entire collection and them perform a toArray then re-build the tree by hand. This won't work once the load/data gets heavier, so I'm curious how do you perform this same task in a "asynchronous" environment?

Simple sample (saying a, b, c, d, e are valid id's in another collection): {source: 'a', dest: 'b'} {source: 'a', dest: 'c'} {source: 'b', dest: 'c'} {source: 'd', dest: 'c'} {source: 'e', dest: 'e'} ect...

Would be great if Mongolian had a built in way to store/query relationships as currently MongoDB makes suggestions (http://www.mongodb.org/display/DOCS/Trees+in+MongoDB) on how to store the data, but takes for granted your running synchronous (sorry if I got those backwards) and can thus perform recursive query operations.

Here is my hackjob of a test case:

    var testIt = replServer.context.testIt = function(collection, options, callback){
      collection.find().toArray(function(err, records){
        var 
          tree, root = {}, id, links = {}, recordId = options.recordId,
          enrich = typeof(options.enrichWith)=='function'?options.enrichWith:false,
          ensure = function(eid){
            if(!links[eid]) links[eid] = {id: eid, inbound: [], outbound: []};
            return links[eid];
          },
          checkAddLink = function(sid, did){
            var src = ensure(sid.toString()),
                dst = ensure(did.toString());
            if(src.outbound.indexOf(dst)==-1) src.outbound.push(dst);
            if(dst.inbound.indexOf(src)==-1) dst.inbound.push(src);
          },
          walk = function(root, direction, records, exists, addTo){
            var vObject, connection, dest, i, bDoAdd=true;
            exists = exists||{};
            if(exists[root.id]) return false;
            exists[root.id] = true;

            if(addTo){
              vObject = addTo;
              vObject[direction] = new Array();
            }else{
              vObject = {};
              vObject.id = root.id;
              if(enrich) vObject = enrich(vObject)||vObject;
              vObject[direction] = new Array();
            }

            if(root[direction]) for(i in root[direction]){
              if(dest = walk(root[direction][i], direction, records, exists)){
                vObject[direction].push(dest);
                vObject.recursive = true;
              }
            }
            exists[root.id] = false;
            return vObject;
          };
        for(id in records){
          checkAddLink(records[id].source, records[id].destination);
        }
        tree = walk(links[recordId], 'outbound', links);
        walk(links[recordId], 'inbound', links, false, tree);
        if(typeof(callback)=='function') callback(tree);
      });
    }
marcello3d commented 12 years ago

This sounds more like a node.js problem than a mongolian db or even mongo problem.

I'm guessing you want something like this (disclaimer, completely untested):

function getTree(collection, parentId, callback, cache) {
    // Keep track of previously mapped nodes
    cache = cache || {}
    if (cache[parentId]) return callback(null, cache[parentId])
    // Find children
    collection.find({parent:parentId}).sort({id:1}).toArray(function(error, array) {
        if (error) return callback(error)
        // Keep track of the number of items we got back
        var count = 0
        var tree = cache[parentId] = {}
        array.forEach(function(node) {
            // Get the subtree (in parallel) for each child node
            getTree(collection, node.id, function(error, childTree) {
                // Insert child
                tree[node.id] = childTree
                count++
                // Last child? return the final tree
                if (count == array.length) callback(null,tree)
            }, cache)
        })
    })
}
jdarling commented 12 years ago

Your probably correct in the "Node.js problem" but it shows its ugly head most in this scenario with Mongolian :(

I'm concerned with the above snippet simply because of the (in parallel) statement. If its truly parallel execution then checking count == array.length to calculate when to call the callback could result in the callback being called before all of the children have been processed. This is where serialization of the data fetch, or some type of "All Done" callback, becomes a necessity.

marcello3d commented 12 years ago

The nature of the task is rather inefficient (since you will be doing O(n) queries for every graph retrieval—you could certainly improve on this with some of the tricks mentioned on the mongodb site).

As for your parallel concern, you'll notice that the count++ is in the callback of the recursive getTree call. Just like a regular recursive function, the callback is called recursively and the parent level getTree callback isn't called until all the child getTree callbacks are called.

jdarling commented 12 years ago

I would completely agree with you on the inefficient part, but due to the nature of the data I don't see an easy way around it shy of creating another collection and building/managing the trees within that collection. As these trees are quite complex and have many 1000's of nodes it becomes more efficient (storage/management wise) to build them on the fly.

Left Right optimization doesn't work as another aspect is the trees branch into one another and are recursive (I didn't create the data I just use/display it). I have a working code sample that I'll post here once I clean it up a bit and close the comment. It might help someone else in the future with a similar issue.

marcello3d commented 12 years ago

As you say, it'll depend on the format of the data and how fast you need the lookups to be, but that's certainly out of the scope of Mongolian to solve!

Did the sample above help? There are async/parallel libraries that simplify this type of task, but I wanted to leave out dependencies.