inveniosoftware / invenio

Invenio digital library framework
https://invenio.readthedocs.io
MIT License
625 stars 292 forks source link

RFC: PostgreSQL support, jsonalchemy query object #2020

Closed MSusik closed 8 years ago

MSusik commented 10 years ago

See also the footer for a second issue.

Improving queries on PostgreSQL records.

With the introduction of version 9.4 PostgreSQL allows using binary representation of JSON as a field - it is called jsonb and it is supported by SqlAlchemy. Obviously its performance is much better than json's.

http://www.postgresql.org/docs/9.4/static/functions-json.html http://docs.sqlalchemy.org/en/rel_0_9/dialects/postgresql.html#sqlalchemy.dialects.postgresql.JSONB

Moreover, after indexing such structures with GIN indices they work reasonably well. For example let us consider table author_jsonb:

id json
integer jsonb

And an example structure:

{
    'papers' = [{'recid': 1 },{'recid': 2}],
    'canonical_name' = 'some_string'
}

Simple queries are very effective!

select id from author_jsonb where json @> '{"canonical_name" : "some_string"}';

My benchmarks show that we might expect maximum 10% performance drop on such queries compared to relational databases. However, the more complicated queries we create, the significantly worse performance is expected (much worse than in relational version of database).

create index papers_index_b on author_jsonb using gin (((json->'papers')) jsonb_path_ops);
select id from author_jsonb where ((json->'papers')) @> '[{"recid" : 2}]';

Moreover, they require querying for the whole key/value pair - for example querying for an existance of a key with ? operator is very expensive, no matter how we index the table.

Fortunately, it seems that there are people who work on improving this (see attached files): http://www.pgcon.org/2014/schedule/events/696.en.html

They introduce a query language for jsonb field called Jsquery. Now my query might look like this:

select id from author_jsonb @@ 'papers @> {"recid": 2}';

This language is supposed to support key existance searches, querying for nested values and array operations better than current jsonb operators. Here is the link to the official development repo.

I believe this is a technology our project might want to use. We can also contribute in its development by suggesting weak points (if we are able to find some). Obviously, it will take some time for SqlAlchemy developers to provide this in their tool, so it is a glance into the future.

Storage engines

From the main page of our readthedocs:

Invenio requires a relational database backend to store information. MySQL or PostgreSQL (coming soon) are required for basic functionality, but there’s also support for a MongoDB of other experimental NoSQL solutions, including using SQLite for local development.

What is in current plans for invenio? If we are going to support configurable storage engines can you tell me more details about what will be available, how will it be reflected on our database, etc.?

jirikuncar commented 10 years ago

@MSusik I'm working on PostgreSQL support (mainly fixing run_sql and friends). Later we wanted to look at query object in JSONAlchemy engines to support native JSON field in PostgreSQL.

WIP in https://github.com/jirikuncar/invenio/tree/postgresql

tiborsimko commented 10 years ago

What is in current plans for invenio?

Here is a brief outline of the status and plans in the data model and SQL/JSON departments.

Historically, Invenio supported records in MARC metadata master format only. The format was stored in bibfmt, its individual exact values in bibrec, bibrec_bibXXx, bibXXx tables for easier browsing, and its processed tokenised terms in idx* tables for easier searching. Moreover, Invenio cached an internal Pythonic representation of the master record metadata format, recstruct, which was basically a dict-of-tuples kind of structure representing MARC and enabling its faster processing. Additional beyond-metadata information, such as the number of citations of the paper or the number of borrowers of the book were stored in other SQL tables and accessed in special manner.

Since a couple of years, we've been proceeding with abstraction on several fronts. The abstraction of virtual fields permitted to access beyond-metadata information, such as number of citations, in the same manner as metadata fields. The abstraction of master format permitted native treatment of richer-than-MARC formats (such as UNIMARC, EAD). This was done by switching internal record format from dict-of-tuples to abstract JSON with models, so called recjson, as a more flexible replacement of recstruct. This also means that bibXXx table storage technique became obsolete, and that we can take advantage of the various existing JSON storage possibilities.

Further abstraction was the multiplication of JSON stores, going beyond records, for example JSON representation of attached fulltext documents or JSON representation of user-entered document annotations. These are basically other JSON structures and one can reuse many things from recjson as well. Hence the later birth of abstract JSONAlchemy which permits to create (and link) several JSON stores among themselves, as it were.

So, basically, we are switching data model from strict MARC model to abstract JSON model with an abstract storage layers behind. We've tested both MongoDB and PostgreSQL storage layers and found that the latter is very often faster for our purposes. (See my blog post on this.) Hence PostgreSQL seems a very good compromise that can manage both SQL data and new plethora of JSON data in the same package. Hence our plan to speed up the migration from MySQL to PostgreSQL (that was in the plans since years already with SQLAlchemy) and that we would like to complete by the end of the year. (Jiri started some experiments that we plan to further intensify in Autumn.)

For your concrete project at hand, you can consider JSONAlchemy as a generic JSON store offering alternative backends (MongoDB, PostgreSQL, etc) and alternative IRs (solr, elasticsearch) to manage and process any kind of JSON structure. We'd like to give users choice, and we'd like to be ready to take advantage of a new-kid-on-the-block that may come in three years. Hence it would be good to think in terms of these abstractions so that any JSON-related improvements are not brought only to records for some concrete use case, but rather to all the other JSON store and use cases. There is a lot of work in this direction so we can divide and conquer more rapidly.

Let's discuss this more IRL? Moving to PostgreSQL has been a recurrent topic in our weekly developer meetings on Monday.

(P.S. Not touching much here that the JSON store is still not self-sufficient and that one still has to tokenise and index its values for non-exact matching via dedicated IR tools, be it native Invenio indexer, or Solr, or Xapian, or elasticsearch.)

tiborsimko commented 10 years ago

I'm working on PostgreSQL support (mainly fixing run_sql and friends).

It would be good to prepare a branch committable to official next so that people can try out PostgreSQL easily even in its not-fully-functional state. E.g. we could introduce several files:

Something like this will permit developers to test the same topical branch (say on author disambiguation matters or on citation ranking matters) with the two DB backends. This should help to speed up converting legacy run_sql SQL statements to SQLAlchemy ORM instructions more easily.

jirikuncar commented 8 years ago

Solved in SQLAlchemy 1.1.0b1 (see http://docs.sqlalchemy.org/en/latest/dialects/postgresql.html?highlight=jsonb#sqlalchemy.dialects.postgresql.JSONB).