samply / blaze

A FHIR® Server with internal, fast CQL Evaluation Engine
https://samply.github.io/blaze
Apache License 2.0
156 stars 18 forks source link

Implement a Blaze-PostgreSQL adapter #1543

Open davidmscholz opened 8 months ago

davidmscholz commented 8 months ago

As stated in the documentation (https://github.com/samply/blaze/blob/master/docs/architecture.md), it should be possible to use PostgreSQL as the Resource Store for Blaze.

I think having that option would be very beneficial to us if it would enable users to perform SQL queries on the resources in order to more easily assess data quality in their Blaze instance.

On the other hand, using PostgreSQL as the Resource Store might come with decreased CQL-query performance. I imagine the performance hit wouldn't be too large for us since the datasets we (CCP) use are typically not very large (< 1GB) and are not expected to grow very rapidly.

Could we implement a Blaze-PostgreSQL adapter and compare the CQL-query performances on different datasets between two instances of Blaze? One instance would use RocksDB as the Resource Store as a gold standard, and the other would use PostgreSQL as the Resource Store.

alexanderkiel commented 8 months ago

Hi David,

it is perfectly possible to implement a Postgres resource store adapter. Blaze is designed to support more than one implementation. For example, the distributed storage variant uses Cassandra as resource store.

However I would not use the resource store for queries without the index. Let me explain that. In the resource store, blaze stores versions of resources by their content hash. That means that a resource with an update appears twice in the resource store and there is no way to tell which version is the current one. That information is only available in the index.

For data quality queries, I would suggest to use CQL directly. I have some examples in the CQL Documentation. More in depth data quality queries should be possible with CQL.

Another way, I could imagine is to implement SQL on FHIR in Blaze.

So at the end, I don't see a performance problem using Postgres as resource store, but the way Blaze stores resources in the resource store is not applicable for queries.

davidmscholz commented 8 months ago

Hi Alex, I didn't know about SQL on FHIR. That sounds definitely useful! However, would that mean that we would need to duplicate the Blaze contents in order to be able to analyze them using standard SQL tools?

If so wouldn't it be possible to "just" store some version-counter or timestamp together with the hash and FHIR resource upon insert when using Postgres as Resource Store?

alexanderkiel commented 8 months ago

SQL on FHIR should be implemented on top of the Blaze storage architecture, the same as I did with CQL. So no duplicate storage.

Why isn't CQL an option for the data quality queries?

davidmscholz commented 8 months ago

Very nice. SQL is just what people are used to and the official CQL docs aren't great to be honest.

alexanderkiel commented 8 months ago

Yes ok, got the point with SQL. But implementing SQL on FHIR will not go fast, even if you like to do it. So can I help you with solving your data quality queries with CQL?

davidmscholz commented 8 months ago

Yes, I can imagine that implementing SQL on FHIR will not be accomplished overnight. Thank you for offering to help with the CQL queries. I have built a workaround solution for the time being. That was mainly motivated by other limitations that have nothing to do with Blaze. My main goal is to make the raw data more accessible to our users (bridgehead responsibles) so that they can more easily perform exploratory analyses themselves.

alexanderkiel commented 8 months ago

I would try to tech people CQL. We already did one such event at the TMF last autumn and we plan a new event for this year. So I would be interested in examples of data quality queries we can use for the lectures. In my opinion CQL has advantages over SQL for FHIR. It's also available in HAPI and Firefly Server. So I would suggest not to ignore it.

davidmscholz commented 8 months ago

I'd be interested in such a CQL event. Especially if it was possible to participate remotely since I currently don't have much time for travel. I will try to write down the most important questions and examples and send them to you.