Consider using a graph db for services

aexvir commented 4 years ago

One of our goals is to have The Zoo as the main source of truth for all our microservices. Our microservices interact with each other in many ways, and we need to represent that interaction.

Currently we have a really simple model hierarchy, mainly because our data is limited, but as we plan to add more and more data, it will be more complex to represent all the possible dependencies between them.

With a graph db we can have all the current services as nodes and use edges to represent the different interactions (requires, uses, belongs, etc) in a much more performant and optimized way. Allowing "long shot" queries without having to do many "inner join" operations to find out relations 5 layers deep.

I wouldn't like to model the whole system as a graph, as it doesn't bring that much value for other resources that we have, and we'd be losing the ORM, but as django allows using multiple databases, I'd move only the parts that benefit from being modelled as graphs there.

The disadvantage here would be that we'll have to make a thing layer to keep operations over the graph db consistent, that layer would probably be just using the neo4j bolt driver. There are a couple of projects aiming to offer ORM-like interfaces, but all seem kind of abandoned.

Or, another option could be to use some multi-model db approach, like agensgraph, which supports both graph and relational DBMS on top of PostgreSQL

We'd also probably be loosing the admin unless we do some work ourselves. Personally I don't use the admin so much, but being an OSS project this might be a bigger inconvenient.

aexvir commented 4 years ago

cc @maroshmka , as he was the one who first suggested it, for their possible work on data flow modelling.

Stranger6667 commented 4 years ago

What kind of queries is limiting us now or could limit on a larger scale? Have we considered tweaking these queries? Since you mention INNER JOIN in this context, I am thinking about recursive common table expressions or ltree which are more typical in PG for working with graph-like data

aexvir commented 4 years ago

We don't have them at the moment @Stranger6667 , as we are barely starting to map our infrastructure. But thanks for your point, I think it's definitely something to consider.

Queries I can imagine that we would like to be solving would be:

services that could affect my service
the reverse one, which services my service will affect when it's down?
services/resources my service is depending on:
- service a
- service b
- service z because of a

Considering that not all services are required for my service to work, even if we relate them together, we'd have either to keep a "flatten" dependency list on each service (which also seems to be the ltree approach), or query dependencies recursively, all while keeping in mind that what's required for my service doesn't it have to be required for a service my service is depending on. And the dependency depth can get pretty wild.

And considering data flows, which @maroshmka can probably tell more about, they would like to model different actions over the data flow, like (extractions, transformations, usage, querying) that the different nodes would perform between each other. Which again can also be represented using relational databases, but this is about discussing if the convenience of using a graph db (standalone or on top of pgsql) is worth the effort.

aexvir commented 4 years ago

JanBednarik commented 4 years ago

For the amount of data we can reasonably expect in The Zoo (like way less than million of Services) having these graph-like relations in PostgreSQL should be fine.

In Django we can use ManyToMany relationships with through models representing edges. And with custom Manager methods, or some util functions generating QuerySet, you can simplify usage to the level of graph database interface. I have been using Django and PostgreSQL for graph data before without any major issues.

Adding a graph database would come with costs of added complexity and does not magically solve everything. I would consider adding graph database, or another solutions, when we hit performance issues with PostgreSQL and Django ORM.

Stranger6667 commented 4 years ago

services/resources my service is depending on: service a service b service z because of a

yep, it seems like a job for a recursive CTE for depth > 1 (with depth one we can use simple joins, but CTE will work here still, maybe less performant). I assume that even for big number of services, e.g. 1M we can utilize index(-only) scans with fairly good performance in the recursive CTE part, however, it is interesting to compare different options.

as @JanBednarik mentions, it should be a really good fit to use an intermediate table for relation representation (requires, uses, etc) in M2M relationship.

For now, it seems to me that PG should still fit, I didn't work with graph databases and it is hard to say if it will be convenient to use or not, but I expect more efforts in the long run for using graph DB approach that for PG mainly because of my lack of experience with specific graph DBs and existing experience with graphs, represented in PG.

maroshmka commented 4 years ago

hey guys, lot of good points and questions here.

I generally agree with @JanBednarik, we shouldn't increase system complexity with graph database when it is not needed.

Therefore I would do a case study that would model a specific problem and we than compare the solutions in terms of complexity, readability, robustness and so. I would imagine something that would meet this criteria (or a subset):

interactive visualisation in web UI
continuing with @aexvir questions - visualisation should answer this questions:
1. what everything do I affect that is 1 relationship long (1 edge)?
2. what everything affects me that is 1 relationship long (1 edge)?
3. what is shortest path from "service A" to "service B"?
4. what are all possible paths from "service A" to "service B" that are max X relationship long (X edges)?

Answering the question would with having more visibility on what a service is gonna influence, e.g. in case of breaking changes or can help in OPS.

I would start with the last one as it is most generic one and we should be able to answer the other ones (with some little changes) with it.

My hypotheses is that the code itself would be less readable and less robust in case of using relational database. In neo4j we could answer the question with 1 line query:

MATCH (mct:REST_API { name: 'mct-api' }),(blkn:REST_APi { name: 'blkn-api' }), p = allShortestPaths((mct)-[*]-(blkn))
RETURN p

for reference

I don't wanna say it would be "better", that's why I'm suggesting a case study to compare the solutions on an very limited use-case that should be implementable in matter of days per each design.

What do you think ? (For sure, as an outsider in respect to this project, I understand if you decide to go with PG)

and @aexvir ad #194 - we're still not sure about this, but we will have update soon.

JanBednarik commented 4 years ago

I can imagine that in the Zoo we will have a few use cases for getting data from this graph. And if there will be performance issues with "naive" implementation in PostgreSQL, we can look on these use cases and try to optimize it. We can do some denormalization, use flexible data types like JSON, hstore or Array, and probably more options. I would just try to implement it for real use cases we have in the Zoo, like a proof of concept, and then we will see if it's good enough or we will need to adjust it, optimize it, or refactor it using different kind of data types or database.

maroshmka commented 4 years ago

Just for the record - I was talking mostly about software qualities like robustness, openness to extensions and readability. Not about performance problems or any other hard technical problems, but software development ones.

kiwicom / the-zoo

Consider using a graph db for services #226