obiba / opal

OBiBa’s core database application for biobanks or epidemiological studies.
http://www.obiba.org/pages/products/opal/
GNU General Public License v3.0
29 stars 22 forks source link

Execute SQL queries on a project's tables #3616

Closed ymarcon closed 3 years ago

ymarcon commented 3 years ago

It would be useful to be able to run SQL queries on a datasource, as examples:

select * from CNSIM1
select * from CNSIM1 union all select * from CNSIM2
select avg(LAB_HDL) as hdl_avg, GENDER as gender from CNSIM1 where LAB_HDL is not null group by GENDER
select avg(LAB_HDL) as hdl_avg, PM_BMI_CATEGORICAL as bmi from (select * from CNSIM1 union all select * from CNSIM2) where PM_BMI_CATEGORICAL is not null group by PM_BMI_CATEGORICAL

This will allow to interface with systems that do SQL queries, such as Redash.

ymarcon commented 3 years ago

REST entry point is per-project datasource:

POST /datasource/xxx/_sql[?id=_id]

where the body is the SQL query and the id parameter is the name of column of identifiers (default is _id).

slaverman commented 3 years ago

It would be useful to be able to run SQL queries on a datasource, as examples:

select * from CNSIM1
select * from CNSIM1 union all select * from CNSIM2
select avg(LAB_HDL) as hdl_avg, GENDER as gender from CNSIM1 where LAB_HDL is not null group by GENDER
select avg(LAB_HDL) as hdl_avg, PM_BMI_CATEGORICAL as bmi from (select * from CNSIM1 union all select * from CNSIM2) where PM_BMI_CATEGORICAL is not null group by PM_BMI_CATEGORICAL

This will allow to interface with systems that do SQL queries, such as Redash.

We are currently trying to create a dashboard in Redash using the REST entry point described in this issue. However it is not possible in Redash to post a plain text body (only form-data or JSON), see https://redash.io/help/data-sources/querying/json-api and the screenshot below. In the post above you mentioned "This will allow to interface with systems that do SQL queries, such as Redash". How did you envision to use Redash in combination with the SQL REST endpoint?

image

ymarcon commented 3 years ago

The idea is more to develop a custom Redash Data Source. The JSON Data Source that you are proposing to use does not have the flexibility of the SQL.

ymarcon commented 3 years ago

FYI you may be interested in the discussion thread "Creating a new query runner (data source) in Redash": https://discuss.redash.io/t/creating-a-new-query-runner-data-source-in-redash/347

slaverman commented 3 years ago

FYI you may be interested in the discussion thread "Creating a new query runner (data source) in Redash": https://discuss.redash.io/t/creating-a-new-query-runner-data-source-in-redash/347

FYI: I wrote a super simple POC for an Opal QueryRunner, see file below. You need to place this file into redash/query_runner. and add the QueryRunner to redash/settings/init.py. After rebuilding the server and worker dockers the Opal datasource is visible in Redash. Currently it is a POC, we're planning to add type_guessing and error handling (see TODOs). Maybe it is nice to publish the python-opal-client to PyPi so that it can easily be used in the QueryRunner.

import logging
import sys
import uuid
import datetime

from redash.query_runner import *
from redash.utils import json_dumps, json_loads

logger = logging.getLogger(__name__)

try:
    import requests

    enabled = True
except ImportError:
    enabled = False

def parse_json(data):
    rows = []
    columns = []

    # TODO: implement TYPES_MAP (see json_ds.py)
    for column in data['columns']:
        columns.append({"name": column, "friendly_name": column, "type": "TYPE_STRING"})

    for row in data['rows']:
        rows.append(dict(zip (data['columns'], row)))

    return {"rows": rows, "columns": columns}

class Opal(BaseSQLQueryRunner):
    noop_query = "SELECT 1"

    @classmethod
    def configuration_schema(cls):
        return {
            "type": "object",
            "properties": {
                "server": {"type": "string", "default": "https://opal-demo.obiba.org"},
                "project": {"type": "string"},
                "token": {"type": "string"},
            },
            "required": ["server","project","token"],
            "secret": ["token"],
        }

    @classmethod
    def enabled(cls):
        return enabled

    @classmethod
    def name(cls):
        return "Opal"

    @classmethod
    def type(cls):
        return "opal"

    def run_query(self, query, user):
        # TODO: try, except, finally for error handling

        server = self.configuration.get("server", "")
        project = self.configuration.get("project", "")
        token = self.configuration.get("token", "")

        ''''
            For testing HTTP call in PS:

            Invoke-WebRequest -Uri "ENDPOINT/ws/datasource/PROJECT_NAME/_sql" `
            -Method "POST" `
            -Headers @{
            "method"="POST"
            "X-Opal-Auth"="TOKEN"
            } `
            -ContentType "text/plain" `
            -Body "SELECT * FROM TABLE"
        '''

        #TODO: check for trailing / at the end of server URL
        url = "{0}/ws/datasource/{1}/_sql".format(server, project)

        headers = {
            'x-opal-auth': token,
            'Content-Type' : 'text/plain'
        }

        response = requests.request("POST", url, headers=headers, data=query)
        json_data = response.json()
        error = None

        '''
            Excepted result/format:
            {
                'rows': [{'var_a': '-99', 'var_b': '123'}, {'var_a': '2', 'var_b': '12312'}], 
                'columns': [{'name': 'var_a', 'friendly_name': 'var_a', 'type': 'TYPE_STRING'}, {'name': 'var_b', 'friendly_name': 'var_b', 'type': 'TYPE_STRING'}]
            }
        '''
        data = json_dumps(parse_json(json_data))

        return data, error

    # TODO: implement get_schema (see snowflake.py)

register(Opal)
ymarcon commented 3 years ago

Excellent! I completely agree with having a real pip library.

Type guessing can be done by getting the table dictionary at /datasource/{project}/table/{table}/variables.

slaverman commented 3 years ago

Excellent! I completely agree with having a real pip library.

Type guessing can be done by getting the table dictionary at /datasource/{project}/table/{table}/variables.

It is a hassle to extract each table name from a query with a lot of tables joined together and then extract the selected column names per table (especially with all columns selected with *, if column names are renamed or a function is applied). Including datatypes in columns (in response) would be a lot easier.

Btw, is it true that aliases for table names are not supported by the SQL API?

ymarcon commented 3 years ago

Right, without a proper SQL parser that is not possible to identify tables and variables. And anyway, the data dictionary is useless when applying a SQL function transformation. I noticed that the JSON output was not correct regarding data types, see #3661, it's fixed.

Regarding table alias, the table name detection is now more restrictive, see #3662.