techfromsage / tripod-php

Object Graph Mapper for managing RDF data in Mongo
MIT License
29 stars 4 forks source link

tripod-php

<CircleCI>

Object Graph Mapper for managing RDF data stored in MongoDB. See also tripod-node.

Features

Quickstart

require_once("tripod.inc.php");

// Queue worker must register these event listeners
Resque_Event::listen('beforePerform', [\Tripod\Mongo\Jobs\JobBase::class, 'beforePerform']);
Resque_Event::listen('onFailure', [\Tripod\Mongo\Jobs\JobBase::class, 'onFailure']);

\Tripod\Config::setConfig($conf); // set the config, usually read in as JSON from a file

$tripod = new Driver(
  "CBD_users", // pod (read: MongoDB collection) we're working with
  "myapp" // store (read: MongoDB database)  we're working with
);

// describe
$graph = $tripod->describe("http://example.com/user/1");
echo $graph->get_first_literal("http://example.com/user/1","http://xmlns.com/foaf/0.1/name"); 

// select
$data = $tripod->select(
  array("rdf:type.u"=>"http://xmlns.com/foaf/0.1/Person"),
  array("foaf:name"=>true);
);
if ($data['head']['count']>0) {
  foreach($data['results'] as $result) {
    echo $result['foaf:name'];
  }
}

// an expensive pre-defined graph traversal query
$graph = $tripod->getViewForResource("http://example.com/users","v_users");
$allUsers = $graph->get_subjects_of_type("http://xmlns.com/foaf/0.1/Person");

// save
$newGraph = new \Tripod\ExtendedGraph();
$newGraph->add_literal_value("http://example.com/user/2","http://xmlns.com/foaf/0.1/name","John Smith");
$tripod->saveChanges(
  new \Tripod\ExtendedGraph(), // the before state, here there was no before (new data)
  $newGraph // the desired after state
);

// save, but background all the expensive view/table/search generation
$tripod = new \Tripod\Mongo\Driver("CBD_users",  "usersdb", array(
    'async' = array(OP_VIEWS=>true,OP_TABLES=>true,OP_SEARCH=>true) // async opt says what to do later via a queue rather than as part of the save
  )
);
$tripod->saveChanges(
  new \Tripod\ExtendedGraph(), // the before state, here there was no before (new data)
  $newGraph // the desired after state
);

Requirements

PHP >= 5.5

Mongo 3.2.x and up.

MongoDB PHP driver version. http://mongodb.github.io/mongo-php-driver/#installation

What does the config look like?

Read the full docs

Before you can do anything with tripod you need to initialise the config via the Config::setConfig() method. This takes an associative array which can generally be decoded from a JSON string. Here's an example:

{
    "namespaces" : {
        "rdf":"http://www.w3.org/1999/02/22-rdf-syntax-ns#",
        "foaf":"http://xmlns.com/foaf/0.1/",
        "exampleapp":"http://example.com/properties/"
    },
    "defaultContext":"http://talisaspire.com/",
    "data_sources" : {
        "cluster1": {
            "type": "mongo",
            "connection": "mongodb:\/\/localhost",
            "replicaSet": ""
        },
        "cluster2": {
            "type": "mongo",
            "connection": "mongodb:\/\/othermongo.example.com",
            "replicaSet": ""
        }
    },
    "stores" : {
        "myapp" : {
            "data_source" : "cluster1",
            "pods" : {
                "CBD_users" : {
                    "cardinality" : {
                        "foaf:name" : 1
                    },
                    "indexes" : {
                        "names": {
                            "foaf:name.l":1
                        }
                    }
                }
            },
            "view_specifications" : [
                {
                    "_id": "v_users",
                    "from":"CBD_users",
                    "type": "exampleapp:AllUsers",
                    "include": ["rdf:type"],
                    "joins": {
                        "exampleapp:hasUser": {
                            "include": ["foaf:name","rdf:type"]
                            "joins": {
                                "foaf:knows" : {
                                "include": ["foaf:name","rdf:type"]
                                }
                            }
                        }
                    }
                }
            ],
            "table_specifications" : [
                {
                    "_id": "t_users",
                    "type":"foaf:Person",
                    "from":"CBD_user",
                    "to_data_source" : "cluster2",
                    "ensureIndexes":[
                        {
                            "value.name": 1
                        }
                    ],
                    "fields": [
                        {
                            "fieldName": "type",
                            "predicates": ["rdf:type"]
                        },
                        {
                            "fieldName": "name",
                            "predicates": ["foaf:name"]
                        },
                        {
                            "fieldName": "knows",
                            "predicates": ["foaf:knows"]
                        }
                    ],
                    "joins" : {
                        "foaf:knows" : {
                            "fields": [
                                {
                                    "fieldName":"knows_name",
                                    "predicates":["foaf:name"]
                                }
                            ]
                        }
                    }
                }
            ],
            "search_config":{
                "search_provider":"MongoSearchProvider",
                "search_specifications":[
                    {
                        "_id":"i_users",
                        "type":["foaf:Person"],
                        "from":"CBD_user",
                        "to_data_source" : "cluster2",
                        "filter":[
                            {
                                "condition":{
                                    "foaf:name.l":{
                                        "$exists":true
                                    }
                                }
                            }
                        ],
                        "indices":[
                            {
                                "fieldName": "name",
                                "predicates": ["foaf:name", "foaf:firstName","foaf:surname"]
                            }
                        ],
                        "fields":[
                            {
                                "fieldName":"result.name",
                                "predicates":["foaf:name"],
                                "limit" : 1
                            }
                        ]
                    }
                ]
            }
        }
    },
    "transaction_log" : {
        "database" : "testing",
        "collection" : "transaction_log",
        "data_source" : "cluster2"
    }
}

Internal data model

Data is stored in Mongo collections, one CBD per document. Typically you would choose to put all the data of a given object type in a distinct collection prefixed with CBD_, e.g. CBD_users although this is more convention than requirement.

These CBD collections are considered read and write from your application, and are subject to transactions recorded in the tlog (see Transactions below).

A CBD might look like this:

{
    "_id" : {
        "r" : "http://example.com/user/2",
        "c" : "http://example.com/defaultContext"
    },
    "siocAccess:Role" : {
        "l" : "an undergraduate"
    },
    "siocAccess:has_status" : {
        "l" : "public"
    },
    "spec:email" : {
        "l" : "me@example.com"
    },
    "rdf:type" : [
        {
            "u" : "foaf:Person"
        },
        {
            "u" : "sioc:User"
        }
    ],
    "foaf:name" : {
        "l" : "John Smith"
    }
}

A brief guide:

Transactions

MongoDB is only atomic at the document level. Tripod datasets store one CBD per document. Therefore an update to a graph of data can impact 1..n documents.

Tripod maintains a transaction log (tlog) of updates to allow rollback in the case of multi-document writes. It is possible (and recommended) to run this on a separate cluster to your main data. For disaster recovery, You can use the tlog to replay transactions on top of a known-good backup.

In production we run a small 2nd cluster in EC2 which stores up to 7 days of tlog, we prune and flush this periodically to S3.

What have you built with this?

The majority of the datasets underpinning Talis Aspire, an enterprise SaaS course management system serving 1M students in over 50 universities worldwide, are powered using graph data stored in MongoDB via the Tripod library.

We built tripod when we needed to migrate away from our own in-house proprietary triple store (incidentally built around early versions of Apache JENA).

We've been using it for 2 years in production. Our data volume is > 500M triples over 70 databases on modest 3-node clusters (2 data nodes) with Dell R710 mid-range servers, 12 cores 96Gb RAM, RAID-10 array of non-SSD disks, m1.small arbiter in EC2.

Why would I use this?

When shouldn't I use this?

Some further limitations

Coming soon (aka a loose roadmap)

Presentations

We presented on an earlier version at MongoUK 2012. Since that time we have resolved the following todos:

Credits

We make use of the excellent ARC and elements of Tripod are based on the Moriarty library, the fruit of some earlier work by Talis to provide a PHP library for Talis' own proprietary cloud triple store (no longer in operation).

The brainchild of kiyanwang and robotrobot @ Talis