Open scripting opened 7 years ago
What's the target data volume like? Does it need to scale out, or is it sufficient to run on a single node? What sort of traversals are you looking for? Get by id, and get users linked to this user by this label? Anything more?
As the author of a graph database in a previous life, definitely go with a graph database. Depending on whether this is going to be a personal graph ("Dave and his friends") or a global graph ("Facebook") or what exactly in between, and whether it has to be transactional across shareded servers or not, the exact projects/products and their management overhead vary dramatically. In the simplest case, simply define the API that you want and serialize to JSON in the file system, using the REST-ful ID of the object as the filename -- or use a key-value store.
It depends on the scale of data you're looking to hold, but I've done something similar with LevelDB (well, actually RocksDB, which is based on LevelDB, but some limited googling makes me think that LevelDB has better NodeJS support (e.g., https://github.com/Level/levelup)).
It's a key-value store, but it's easy enough to model SVO triples as S:V -> O, V:O -> S, etc. depending on the query needs.
Cayley (https://github.com/cayleygraph/cayley) looks like a really interesting option, but I don't know anything about its NodeJS story.
I'll try to answer the questions asked in order --
I don't know how big the app will get. That's how these things go. Most of my apps don't grow that big. But you want to know that if it does grow, the upgrade path is smooth. So I'd say for now running on a single instance is fine. But I wouldn't want to rule out having a database that's spread out over many machines.
I've done the JSON files in a folder approach many times. I want something higher level with good browsing and debugging tools available. I had the ideal environment in Frontier's object database, but that's a Mac OS and Windows thing, not Unix. If I did a folder of JSON files I'd be starting over completely from scratch. Part of the reason I asked the question was to see if I could build on other people's work or if I'd have to start from scratch.
JB, thank you. That's exactly the kind of advice I'm looking for. Specific products that people have used that I can try out.
If single node is ok, I'd highly recommend Neo4j. When going distributed, things get tricky, and often it's usually enough to implement graph semantics on top of distributed databases like Cassandra (which had some graph functionality done via Tinkerpop, before DataStax shifted all graph focus to DSE). There are some other "native" graph features in stores like ArangoDB, OrientDB, etc.
Neo4J has been at it for a long time though, and it's probably the most robust implementation out there. If your payloads are small, or if you store them separately in a different store, then Nneo4J can go a long way.
InterPlanetary File System? https://hackpad.com/ep/pad/static/C3MkNagVKqB Haven't looked around much…
I know this doesn't fit what you need now, but for the future, you might want to consider some form of blockchain distributed database, with private/public pairs keeping things secure for people.... just something to consider for later.
If this really is a social graph then I'd strongly argue for looking into a graph database. The one thing about graph databases that I've found is that non mainstream storage engines often have usability issues around the operational side. Persistent storage is hard and large scale usage is what gets the bugs out. None of the graph databases that I'm aware of, save Neo4J, actually has wide scale usage along the lines of the SQL databases or even the NoSQL databases.
Now, that said, a graph database should support the type of neighbor style queries you generally want to do for a social graph. Neo4J is one which has been around for a while. I generally opt out of it though due to its Java orientation (not a Java fan). Cayley looks to be fairly interesting: Cayley | Home Page in that it seems to be a JavaScript query engine which maps the underlying storage onto other types of storage (levelDB, Mongo, etc). Its also written in Go which tends to be small, fast and easy to deploy.
Side Note: Cayley passed my open source engineering practices analyzer with an A grade (I built this for Ruby but it works for most open source tools).
Given your focus on JavaScript as a development tool, both Neo4J and Cayley seem to be viable since both support JavaScript for queries. Cayley seems to be a pure open source project while Neo4J is licensed software. There's no pricing info on the Neo4J site which always says to me "Expensive; if we put the pricing out there publicly, you'd never bother to call us".
@fuzzygroup Neo4J community edition is GPL. The enterprise edition is AGPL v3 for open source projects. The latter is also available under a commercial license for commercial projects.
Some thoughts...
One of the things I would hope for would be an interactive database viewer and editor.
I took a look at Neo4J, thought I could read something to get a quick overview, but their website is a maze of neat-net-tricks intending to make a first-use experience easy, but got complicated in a corporate way.
It seems to me if it's designed to store a graph, the hello world app should store a graph and perhaps do some queries and an update or two. And since JavaScript is such a popular language, it would be in simple JavaScript.
Also I was surprised that so far no one has recommended Redis. I thought that would have been the most popular choice. I was going to start implementing there before I thought to ask the brain trust what they would do.
So far this has been very illuminating and helpful. It's one of the benefits of running Scripting News, that I can ask questions like this and so quickly get back thoughtful and useful responses.
Keep diggin! :-)
I've been experimenting with ArangoDB for a little while, and I really, really like it. It advertises itself as a "multi-model" database -- this includes the ability to model data as a graph. Their training materials are very good. I highly recommend their Graph Course for Freshers to understand the graph database capabilities.
Okay, i just did a Google search, so that shows my fantastic expertise in graph databases. 8-)
Found this presentation : "Redis Graph, A graph database built on top of Redis" [PDF] that can be of interest.
https://github.com/swilly22/redis-graph by @swilly22
Hey Dave - At least for me the reason I didn't suggest Redis was that graph databases are such a specialized commodity even today. If you had asked about what NoSQL store to use or what Key Value store to use then Redis would have been top of the list for sure. Now, that said, the redis-graph presentation above is very, very interesting. If you want to go the redis route then I would highly recommend Redis in Action by Josiah Carlson (Manning Press) which has a chapter on Building a Simple Social Network.
Ashic - thank you for the comment on the Community Edition. The differences here: https://neo4j.com/editions/ are interesting in that the community edition lacks "Unlimited Graph Size" which suggests that there are low level technical differences in the community edition that limit what can be done with it. Neo4J may be a fantastic product but personally I wouldn't touch any type of database with a capacity limit without that limit being absolutely understood. You don't see Postgres for example saying "Only 25 tables w/ 1000 rows" (and even that is at least understandable; all the Neo4J site says is that the community edition lacks "unlimited graph size"). And I wouldn't let the AGPL anywhere near my code at least until it has been legally tested.
If you can make do with running on a single instance I would also consider Realm: https://realm.io/docs/javascript/latest/
It is not a graph database as such, but closer to a real object database, which at least in my use cases has been great to model things very similar to social graphs. There is something amazing about just working with regular objects that obviously can reference other objects, and having that all be scalable, persistent and queryable without effort.
It is crazily fast and just a pleasure to work with in node. The only drawback I have found is that you need the commercial version to share Realms across machines (for which they do some fancy live sync, which is actually really cool, but looks pretty expensive).
One of the things I would hope for would be an interactive database viewer and editor.
Realm does have a pretty nice viewer and editor: https://github.com/realm/realm-browser-osx
@scripting,
I would suggest you take a look at Virtuoso, a multi-model relationship DBMS that supports structured data represented as relational tables and/or property graphs. It supports both SQL and SPARQL query languages for Data Definition and Data Manipulation operations.
Virtuoso is the DBMS behind DBpedia [2][3]. It is what I use for stuff like nanotations [4] etc.
[1] https://github.com/openlink/virtuoso-opensource [2] http://dbpedia.org [3] http://dbpedia.org/sparql [4] https://kidehen.blogspot.com/2014/07/nanotation.html [5] https://www.youtube.com/watch?v=aWvYQ338iiM -- Simple Nanotation demo
Happy to answer any questions you might have. Ditto getting you going pronto :)
Note, you can also get going with commercial edition using a conventional software installer for Mac OS X, Linux, Windows too.
@fuzzygroup The Community Edition "size limitation" is that it runs single node (no clustering). There's no limitation on how many vertices, edges, etc. are in your graph on that one node (AFAIK).
This question reminded me of a great post from Sarah Mei about attempting to use a nosql DB to build a social network and realizing that for any relatively complex system with complex data model you very quickly start missing SQL and a relational database.
http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/
The article title calls out mongodb but it's really about nosql / document / graph databases vs traditional relational DBs. And I agree with her conclusions from my own experience.
@galori I've found graph dbs to be useful for quite a few use cases. That said, two of the most in depth graph use cases were not achievable with the graph dbs I'd used before, or new ones we spiked. In one case, the data volumes were reasonable enough to fit on one modest machine, and in the end we used a focused schema in Postgres, with more complex traversals done via stored procedures (yes, I know... but the traversal queries were so common that having applications submit them was pointless, and sprocs meant we could improve them centrally). It worked really well. The graphs were stored as vertices with edge lists, with json payloads.
In another case though, data volumes were much higher. In the end, we had a very focused Cassandra store to deal with the scale out. Having spiked various graph db implementations, we knew what our queries were, and this translated to an achievable Cassandra schema. Of course, it wasn't flexible in terms of running ad-hoc queries, but the use cases were met. If more flexible querying was required, we'd pump dumps to parquet / subset to SQL and use Spark / SQL to analyse.
It seems to me if it's designed to store a graph, the hello world app should store a graph and perhaps do some queries and an update or two. And since JavaScript is such a popular language, it would be in simple JavaScript.
I made a small example of how to do a (super minimal) social network with Realm in node:
const Realm = require('realm');
// Define model with users and labelled links between them
const UserSchema = {
name: 'User',
primaryKey: 'id',
properties: {
id: 'string',
name: 'string',
age: 'int',
links: {type: 'list', objectType: 'Link'}
}
};
const LinkSchema = {
name: 'Link',
properties: {
label: 'string',
user: 'User',
}
};
// Open a Realm to contain the social network
Realm.open({schema: [UserSchema, LinkSchema]}).then(realm => {
// create some users
realm.write(() => {
if (!realm.empty) return; // only add users on first run
const john = realm.create('User', {id: '1', name: "John", age: 40, links: []});
const david = realm.create('User', {id: '2', name: "David", age: 30, links: []});
const sarah = realm.create('User', {id: '3', name: "Sarah", age: 42, links: []});
const jeff = realm.create('User', {id: '4', name: "Jeff", age: 25, links: []});
// add some relations between the users
john.links.push({label: "friend", user: david});
john.links.push({label: "colleague", user: sarah});
john.links.push({label: "friend", user: jeff});
sarah.links.push({label: "colleague", user: john});
});
// lookup a user by id
const john = realm.objectForPrimaryKey('User', '1');
console.log("All relations:", john.links, '\n');
// find all friends older than 25
const friendsOver25 = john.links.filtered("label = 'friend' AND user.age > 25");
console.log("Friends over 25:", friendsOver25);
});
I would imagine the simplest, most production ready, scale ready setup would be based on traditional relational SQL - for a couple reasons:
So, I would propose either CockroachDB or one of its hosted brethren like AWS' Aurora or Google's Cloud Spanner with UUID v4 based identifiers prefixed with region identifiers. DBAs would aim to contain logical region identifiers within localized datacenters, but in the early days this could all live in a single region without issues (or even on a single server for the -very- early days). If operational skill was lacking in the startup team, I would simply go with MariaDB or Galera Cluster and hire a DBA once scaling the database becomes an issue (should be well after the company is making at least some money).
For presence and chat, the story changes pretty dramatically - so only looking at recreating facebook, not messenger.
Facebook's costs seem low to many - but they had to write a huge number of patches to Mysql/MariaDB - rewrite practically all of PHP, contribute quite a bit of network driver fixes, write their own memcached proxy, etc.
With NodeJS (or any asyncio language here - python3, elixer, HACK, rust, etc etc here) / CockroachDB / Redis Cluster v4 / RabbitMQ / XMPP, the vast majority of facebook's stack can be built out of open-source tech these days.
I don't work at Facebook, but judging by their MySQL patches, I would guess the vast majority of their transactions still occur within a traditional SQL tool - they just do a very excellent job of app-level sharding (like the region sharding mentioned above).
I like to ask open-ended technical questions here on GitHub. ;-)
Suppose I wanted to create a server that implemented a social graph.
Every user has an ID and a set of attributes. And links to other users with labels on the arcs.
The database should be fast, open source, runs on Unix, accessible easily from Node, supports huge structures, is accessible over the net, is deployed widely, debugged, is stable, not changing.
What would you use if you were going to implement a social graph?