Closed MSusik closed 6 years ago
Would be also important to compare a typical usage of Neo4J for these requests.
You should look at OrientDB instead. It's a hybrid between a document store and a graph database. In most benchmarks, it is faster than Neo4J as well.
@eamonnmag As already mentioned in the discussion: It would imply using three different DBs: Postgres, Orient and Elasticsearch. And, what's more important, IT will never agree with using Orient for records.
@kaplun The same goes for neo4j
@MSusik here we are not discussing about where to store things, but where relation-related queries are performed. If it ends up the best thing is to have a separate slave DB where we expose the relations, then why not?
@kaplun Sure, I'm just trying to cool down the enthusiasm, as it is quite obvious that Orient or Arango will be an order of magnitude faster than Postgres and ES.
Thanks for setting up the issue. Looking forward to see the results.
I understand, I was responding to @kaplun in the suggestion to use Neo4j. To me, the simplest option is the best though. Neo4j is actually quite messy to use for more complex queries with their Cypher language. Having a standard postgres database with a proper relational model, indexed with elasticsearch makes the most sense. It's also the most simple and most maintainable for those that work on this in the future. Too many technologies will just confuse things. Also, I don't even believe that Elastic Search should be considered as a 'competitor' to postgres (if I understand your benchmarking). It serves a different purpose, and that purpose is certainly not an ACID compliant database (see https://aphyr.com/posts/323-call-me-maybe-elasticsearch-1-5-0). Let the secure, trusted databases carry the data, and let elastic search hold an optimised index for your search use cases.
@eamonnmag the query we are benchmarking here are indeed not requiring ACIDity in the sense that they are only SELECT query to aggregate information which could also be done on elasticsearch DB, if it happens to be better suited at the particular job in terms of performance.
Here are some benchmarks results I got from a random graph I built (50k records, 25k authors, 5k institutions, 2k keywords, 1.5M references and about 700k signatures). Results are averaged over 10 random authors for PostgreSQL+JSONb and over 1000 random authors for the relational approach. All results have been obtained on the same machine -- my laptop.
PostgreSQL, records as JSONb, GIN index Query 1 = 403.374 ms / author Query 2 = 555.254 ms / author Query 3 = 525.885 ms / author Query 4 = 254.306 ms / author Query 5 = 244.964 ms / author Query 6 = 255.022 ms / author Query 7 = 1063.657 ms / author Query 8 = (never finished after minutes) / country
Full relational ORM-based approach, using models to represent records and their fields (Django* + PostgreSQL). These timings include the time required for building the model objects from the underlying queries. Parameters are all default, nothing has been tuned for performance. Query 1 = 8.872 ms / author Query 2 = 1.045 ms / author Query 3 = 1.615 ms / author Query 4 = 5.240 ms / author Query 5 = 6.406 ms / author Query 6 = 4.833 ms / author Query 7 = 11.239 ms / author Query 8 = 401.116 ms / country
*: We can expect the same results with SQLAlchemy, if not better. I used Django because it was faster for me to get something working without thinking too much :)
:scream: OK, so assuming both benchmarks are providing the correct results looks like ORM is ~100x faster than JSONB, for complex selects. For completeness would be nice to also see elasticsearch benchmarks. @glouppe can you share your generator that produced also JSON so that it can be adapted to elasticsearch? @tiborsimko mentioned he (or someone in IT) might help with that.
The data can be downloaded at https://drive.google.com/file/d/0B1kJ1OjakiU4NlFRUzl5VUR0bk0/view?usp=sharing It is in JSON, following the schema given above (except for the "citations" field which I renamed to "references", because that is the correct term).
Interesting numbers! What is the version of PostgreSQL used and the configuration? How were indexes created for those JSON columns? (Ditto in the ORM approach?) Have you tried to compare EXPLAIN SELECT ...
on the queries? A certain performance penalty associated with doing "self-joins inside JSON" is naturally expected, when compared to using RDBM engine -- but a performance penalty of ~100 seems a bit excessive.. Would be good to understand it.
(BTW see for example this)
And yes, let's reproduce the same tests in Elasticsearch. We plan to do some progressive scalability tests over summer here in the West End.
What is the version of PostgreSQL used and the configuration?
PostgreSQL 9.4, default settings.
How were indexes created for those JSON columns?
CREATE INDEX idxginp ON records USING gin (record jsonb_path_ops);
(Ditto in the ORM approach?)
Nothing has been tuned for the ORM. No index has been set. Performance is measured as it is, out-of-the-box.
Here are all my scripts for the benchmarks https://github.com/glouppe/inspire-data-benchmarks -- see the notebooks inside.
And yes, let's reproduce the same tests in Elasticsearch. We plan to do some progressive scalability tests over summer here in the West End.
Just wondering because I never used it, but is it possible to define this kind of queries in ES in the first place? How would you implement query 1 for example?
(I am under the impression that ES is tailored for fulltext searches, possibly with filters and so on, but not so much for such select queries. I may be wrong though.)
It's clear that the best solution here is the ORM in SQLAlchemy alongside ElasticSearch for queries. Benchmarking more will not tell you anything different. Moreover, as I said before, ElasticSearch should not even be compared with Postgres in these benchmarks, they are completely different systems.
So, my suggestion would be to use the ORM approach for direct DB insertion, storage, and querying (for construction of the index). Then from the postgres db, create one or more indexes on ElasticSearch that will cover your search use cases.
Here are all my scripts for the benchmarks https://github.com/glouppe/inspire-data-benchmarks -- see the notebooks inside.
Thanks. I'll be finally able to have a look after all the conferences are over.
Just wondering because I never used it, but is it possible to define this kind of queries in ES in the first place?
I agree that advanced grouping/joining may not be Elasticsearch's primary strength... However there several possible techniques: (1) note that JSON-for-indexing is not the same as the record-JSON stored; it is an enriched JSON, where important properties can be "expanded", so that Elasticsearch can work with them, and provide faceting on them. Depending on the needs, we may be able to expand parts of JSON in this way. (2) We may use generic parent-child indexing capabilities, or (3) there is always a possibility to leave things for runtime and do it on the application side, such as triple nested refersto/citedby. Depends on the use case and the needed speed. As you suggested, we can take the above query 1 for comparison...
It's clear that the best solution here is the ORM in SQLAlchemy alongside ElasticSearch for queries.
I'm not sure whether it is the best overall... E.g. what about insertion speed? The ORM approach will have to execute many SQL queries when ingesting new records. We should not forget to profile this task as well, especially as database tables grow. (Telling from experience, because the ORM approach is similar to "bibxxx-on-steroids" approach that we used in Invenio v1.x legacy times, where we were also penalised for having to do 50+ inserts for a record...)
Have you tried to compare EXPLAIN SELECT ... on the queries?
... so I have finally got to looking at this now that all the conferences are over.
I took query no 4 en guise of example, because it is a simple GROUP BY one, and the basis for the other more complex queries.
I loaded your example file bench-publications.json
.
Here are the results of EXPLAIN SELECT immediately after loading:
=> EXPLAIN ANALYZE SELECT keyword, count(keyword)
FROM records r, jsonb_array_elements_text(r.record->'keywords') keyword
WHERE r.record->'authors' @> '[{"id": 13}]' group by keyword;
QUERY PLAN
HashAggregate (cost=8435.00..8436.00 rows=100 width=32) (actual time=615.266..615.285 rows=63 loops=1)
Group Key: keyword.value
-> Nested Loop (cost=0.01..8410.00 rows=5000 width=32) (actual time=26.668..615.130 rows=93 loops=1)
-> Seq Scan on records r (cost=0.00..8310.00 rows=50 width=1160) (actual time=26.618..614.582 rows=21 loops=1)
Filter: ((record -> 'authors'::text) @> '[{"id": 13}]'::jsonb)
Rows Removed by Filter: 49979
-> Function Scan on jsonb_array_elements_text keyword (cost=0.01..1.00 rows=100 width=32) (actual time=0.021..0.021 rows=4 loops=21)
Planning time: 0.151 ms
Execution time: 615.371 ms
The value of 615ms has roughly the same order of magnitude as your observation of 254ms. (I may have older laptop with more parallel running processes, you may have used several different "random" test records, etc.)
In the output above, note that 49979 out of 50000 rows were "removed by filter", which suggests that a full table scan like situation has been happening... This would explain the slowness, which may be understandable due to not having created any index.
After executing your general index creation:
CREATE INDEX idxginp ON records USING gin (record jsonb_path_ops);
the situation does not really change much... Ditto for non jsonb_path_ops
index as well.
However, if one adds a dedicated helper index on authors:
CREATE INDEX idxginauthors ON records USING gin ((record -> 'authors'));
then the situation looks much better:
=> EXPLAIN ANALYZE SELECT keyword, count(keyword)
FROM records r, jsonb_array_elements_text(r.record->'keywords') keyword
WHERE r.record->'authors' @> '[{"id": 13}]' group by keyword;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=341.95..342.95 rows=100 width=32) (actual time=5.151..5.168 rows=63 loops=1)
Group Key: keyword.value
-> Nested Loop (cost=28.39..316.95 rows=5000 width=32) (actual time=1.542..5.078 rows=93 loops=1)
-> Bitmap Heap Scan on records r (cost=28.39..216.94 rows=50 width=1160) (actual time=1.506..4.722 rows=21 loops=1)
Recheck Cond: ((record -> 'authors'::text) @> '[{"id": 13}]'::jsonb)
Rows Removed by Index Recheck: 106
Heap Blocks: exact=127
-> Bitmap Index Scan on idxgin6 (cost=0.00..28.38 rows=50 width=0) (actual time=0.926..0.926 rows=127 loops=1)
Index Cond: ((record -> 'authors'::text) @> '[{"id": 13}]'::jsonb)
-> Function Scan on jsonb_array_elements_text keyword (cost=0.01..1.00 rows=100 width=32) (actual time=0.014..0.015 rows=4 loops=21)
Planning time: 0.447 ms
Execution time: 5.272 ms
The bitmap index could have been used and we are reaching query speed times of 5ms instead of 615ms.
@glouppe Can you please recheck? It seems probable that a full table scan is being performed for most of the above JSON queries, which may explain the huge speed difference you have seen... They shouldn't be really necessary in most situations.
@glouppe Can you please recheck? It seems probable that a full table scan is being performed for most of the above JSON queries, which may explain the huge speed difference you have seen...
Yes, I suspected the same. There may be one or several scans needed for some of the queries.
I'm not sure whether it is the best overall... E.g. what about insertion speed? The ORM approach will have to execute many SQL queries when ingesting new records. We should not forget to profile this task as well, especially as database tables grow. (Telling from experience, because the ORM approach is similar to "bibxxx-on-steroids" approach that we used in Invenio v1.x legacy times, where we were also penalised for having to do 50+ inserts for a record...)
Yes, this is true. In general insertion will be slower. However this can be mitigated by creating all related objects in bulk, eventually resulting in one or few INSERTs. In any case, we should also ponder the number of times a record is created/updated versus the number of times it is selected.
In addition, while performance is important, this should not be the sole factor in this decision. Leveraging an existing and mature technology as SQLAlchemy allows for faster development time, robustness, correctness* and much easier adoption.
*: Homework: some of the PostgreSQL queries above are wrong. Find them and correct them.
I'm not sure whether it is the best overall... E.g. what about insertion speed? The ORM approach will have to execute many SQL queries when ingesting new records. We should not forget to profile this task as well, especially as database tables grow. (Telling from experience, because the ORM approach is similar to "bibxxx-on-steroids" approach that we used in Invenio v1.x legacy times, where we were also penalised for having to do 50+ inserts for a record...)
This is very true. I am hoping SQLAlchemy natively supports what we ended up implementing in bibupload, i.e. the ability to only fire SQL statements for modified values. @glouppe @eamonnmag?
In addition, while performance is important, this should not be the sole factor in this decision. Leveraging an existing and mature technology as SQLAlchemy allows for faster development time, robustness, correctness* and much easier adoption.
A bit OT but :+1: (of course provided we don't discover little by little that in fact ORM is 100 times slower than JSONB in the real case. If they end up comparable in performance then IMHO it's still OK).
Still OT for this particular Github issue: I would also add that ORM gives the non benchmark-able peace of mind of integrated relation support (avoiding having to handle most of the synchronization issues by hand) plus support flask(-super)-admin while still retaining the possibility of JSONSchema. (by deriving schemas from models)
Homework: some of the PostgreSQL queries above are wrong. Find them and correct them.
Note that queries may be perhaps OK -- even though I'd consider possibly moving away and/or cloning some referential fields such as record ID into separate ORM column, as done in my prior study.
Even without rewriting queries too much, we can achieve better performance by sorrounding means.
For example, in my query 4 analysis above, I have simply created a new appropriate index, which resulted in ~100x speed improvement. (5 ms instead of 615 ms)
As another example, let me take query 1 now. You were interested in how this query might work in Elasticsearch world, but let's leave that for later, and let's see if we can speed it up on PostgreSQL/JSONB already.
Here is vanilla EXPLAIN SELECT for query 1:
=> EXPLAIN ANALYZE SELECT r.record->>'name', pid, number FROM records r, LATERAL (
SELECT o->>'id' as pid, count(o->>'id') as number FROM records r,
jsonb_array_elements(r.record->'authors') o
WHERE record->'authors' @> '[{"id": 13}]' group by o->>'id') c
WHERE r.record->>'id' = pid and pid != '13';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=394.41..9013.79 rows=24750 width=1200) (actual time=8.545..195.512 rows=378 loops=1)
Hash Cond: ((r.record ->> 'id'::text) = ((o.value ->> 'id'::text)))
-> Seq Scan on records r (cost=0.00..8060.00 rows=50000 width=1160) (actual time=0.008..13.130 rows=50000 loops=1)
-> Hash (cost=393.17..393.17 rows=99 width=40) (actual time=8.421..8.421 rows=378 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 18kB
-> HashAggregate (cost=390.95..392.18 rows=99 width=32) (actual time=8.067..8.224 rows=378 loops=1)
Group Key: (o.value ->> 'id'::text)
-> Nested Loop (cost=28.39..353.82 rows=4950 width=32) (actual time=1.871..7.696 rows=383 loops=1)
-> Bitmap Heap Scan on records r_1 (cost=28.39..216.94 rows=50 width=1160) (actual time=1.786..6.200 rows=21 loops=1)
Recheck Cond: ((record -> 'authors'::text) @> '[{"id": 13}]'::jsonb)
Rows Removed by Index Recheck: 106
Heap Blocks: exact=127
-> Bitmap Index Scan on idx0 (cost=0.00..28.38 rows=50 width=0) (actual time=1.166..1.166 rows=127 loops=1)
Index Cond: ((record -> 'authors'::text) @> '[{"id": 13}]'::jsonb)
-> Function Scan on jsonb_array_elements o (cost=0.01..1.50 rows=99 width=32) (actual time=0.030..0.053 rows=18 loops=21)
Filter: ((value ->> 'id'::text) <> '13'::text)
Rows Removed by Filter: 6
Planning time: 0.480 ms
Execution time: 195.661 ms
One can see two things here: (1) the inner loop uses the bitmap heap scan and index, hence it is pretty efficient; but (2) the outer loop uses full sequential table scan, going through 50k rows, which seems to account for bad performance.
It seems that the PostgreSQL query analyser couldn't efficiently combine the query parts out of the box by itself. (Or perhaps it simply considered that given this relatively small row count (50k), the hash join combination is sufficient... and I cannot really blame it, because 195ms seems like an acceptable speed.)
If we want to be quicker, let's try to advise PostgreSQL to combine queries differently, without even rewriting them too much. E.g. as a quick test, let's forbid hash joins globally:
=> SET enable_hashjoin = false;
which gives:
=> EXPLAIN ANALYZE SELECT r.record->>'name', pid, number FROM records r, LATERAL (
SELECT o->>'id' as pid, count(o->>'id') as number FROM records r,
jsonb_array_elements(r.record->'authors') o
WHERE record->'authors' @> '[{"id": 13}]' group by o->>'id') c
WHERE r.record->>'id' = pid and pid != '13';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=396.17..29774.59 rows=24750 width=1200) (actual time=8.059..17.757 rows=378 loops=1)
-> HashAggregate (cost=390.95..392.18 rows=99 width=32) (actual time=7.944..8.118 rows=378 loops=1)
Group Key: (o.value ->> 'id'::text)
-> Nested Loop (cost=28.39..353.82 rows=4950 width=32) (actual time=1.744..7.593 rows=383 loops=1)
-> Bitmap Heap Scan on records r_1 (cost=28.39..216.94 rows=50 width=1160) (actual time=1.662..6.064 rows=21 loops=1)
Recheck Cond: ((record -> 'authors'::text) @> '[{"id": 13}]'::jsonb)
Rows Removed by Index Recheck: 106
Heap Blocks: exact=127
-> Bitmap Index Scan on idx0 (cost=0.00..28.38 rows=50 width=0) (actual time=1.075..1.075 rows=127 loops=1)
Index Cond: ((record -> 'authors'::text) @> '[{"id": 13}]'::jsonb)
-> Function Scan on jsonb_array_elements o (cost=0.01..1.50 rows=99 width=32) (actual time=0.031..0.054 rows=18 loops=21)
Filter: ((value ->> 'id'::text) <> '13'::text)
Rows Removed by Filter: 6
-> Bitmap Heap Scan on records r (cost=5.22..293.66 rows=250 width=1160) (actual time=0.016..0.016 rows=1 loops=378)
Recheck Cond: ((record ->> 'id'::text) = ((o.value ->> 'id'::text)))
Heap Blocks: exact=378
-> Bitmap Index Scan on idx2 (cost=0.00..5.16 rows=250 width=0) (actual time=0.014..0.014 rows=1 loops=378)
Index Cond: ((record ->> 'id'::text) = ((o.value ->> 'id'::text)))
Planning time: 0.591 ms
Execution time: 17.978 ms
In this way, we get nested loop instead of hash join, which is indeed efficient and leads to ~10x speed improvement (18ms instead of 195ms).
Seems to me that the nested JSONB query speed is pretty acceptable...
In addition, while performance is important, this should not be the sole factor in this decision.
Fully agreed. I'm commenting here solely about benchmarks, because the issue is about benchmarking... Some non-performance related considerations have been touched in the past, e.g. on two Invenio Developer Fora, on 2015-05-04 and 2015-06-01.
In this issue the results of benchmarking will be announced.
Benchmark
The benchmark will consist of a set of queries which should be run on databases. The database will contain simplified records. The structure of the records is as follows:
Paper record
Author record
Affiliation record
Queries
1). Get names and ids of coauthors and for every of them a number of paper written by them and a given author (in other words, render the coauthors box on an author profile).
2). Get name variants and number of their appearances on the papers for a given author (in other words, render the name variants box on an author profile).
3). Get affiliations, their ids, and number of papers written by a given author where the author was associated with this affiliation (in other words, render the affiliations box on an author profile)
4). Get keywords and number of their appearances on the papers for given author (in other words, render the keywords box on an author profile).
5). Get number of papers and number of single authored papers for every category of papers for a given author (in other words, render the papers box on an author profile without sums over all categories).
6). Get number of all papers and number of all single authored papers for a given author.
7). Get all citations which cite a given author's paper (both ids of paper).
8). Get number of papers that cite papers written by authors who are affiliated with an institution from a given country. (one author from an institution from a given country is enough in order to have all the references counted in the result).
Dataset
TBA
Machine used
TBA
PostgreSQL setup
TBA
PostgreSQL JSONB queries
There is a
records
table with arecord
JSONB column.The input for the queries is author with id 13 and France for affiliations (query nr 8)
1).
2).
3).
4).
5).
6).
7).
8). (needs improvements)
Please note that some of those queries might be improved. I confirm that queries above work on my machine and give expected results.
Elasticsearch setup
TBA (relations will be resolved before pushing to db!)
Elasticsearch queries
TBA
Other things
Thanks to @dset0x for helping me with writing some of JSONB queries.