Runtime: Setup test fixtures for resolvers and migrate existing metrics tests

rilldata / rill

Rill is a tool for effortlessly transforming data sets into powerful, opinionated dashboards using SQL. BI-as-code.

https://www.rilldata.com

Apache License 2.0

1.73k stars 117 forks source link

Runtime: Setup test fixtures for resolvers and migrate existing metrics tests #5218

Closed begelundmuller closed 1 month ago

begelundmuller commented 4 months ago

The idea here is to setup a test file abstractions for runtime/resolvers, where test queries for any resolver, security claims, project files, and backend can be written in a nice syntax.

The idea here is to draw inspiration from sqllogictest (also see this example from DuckDB's test files). For example, this could be YAML test files in runtime/resolvers/testdata, which are loaded, parsed and executed.

It would also be nice to support a way to automatically update the expected test output, e.g. using an --update flag as described in this blog post.

Lastly, we should start migrating our existing metrics tests to the new files.

begelundmuller commented 4 months ago

The idea is that each file could define a set of project files and OLAP connectors to target, and then a list of tests to run and expected results. We might then have many of these files, such as resolvers/testdata/metrics_unnest.yaml, resolvers/testdata/metrics_measure_filters.yaml, etc.

Some of the goals here are:

Having the test files be self-contained instead of requiring the project files to be defined separately in testruntime (which has the problem that making a tweak for one test causes a lot of unrelated tests other places to fail)
Being able to optionally test the export output instead of the query output
Being able to selectively test against multiple connectors (DuckDB and ClickHouse can be started with testcontainers, Druid will run only if a test cluster is defined)
Being able to test queries for different security claims

I'm sure this can be made much cleaner, but here's a draft YAML spec for a test file.

files:
  model.sql: SELECT ...
  metrics_view.yaml:
    type: metrics_view
    dimensions:
      - ...
    measures:
      - ...
connectors:
  - duckdb
  - clickhouse
tests:
  test_name:
    resolver: metrics
    properties:
      query:
        metrics_view: mv
        ...
    result:
      rows:
        - ...
      csv: >
        xxx,xxx,xxx
  other_test_name:
    variables:
      rill.metrics.approximate_comparisons: true
    claims:
      attributes:
      - ...
      rules:
      - ...
    resolver: metrics_sql
    properties:
      sql: SELECT ...
    result:
      rows:

egor-ryashin commented 4 months ago

files:
  model.sql: SELECT ...
  metrics_view.yaml:
    type: metrics_view
    dimensions:
      - ...
    measures:
      - ...
connectors:
  - duckdb
  - clickhouse
tests:
  test_name:
    resolver: metrics
    properties:
      query:
        metrics_view: mv
        ...
    result:
      rows:
        - ...
      csv: >
        xxx,xxx,xxx
  other_test_name:
    variables:
      rill.metrics.approximate_comparisons: true
    claims:
      attributes:
      - ...
      rules:
      - ...
    resolver: metrics_sql
    properties:
      sql: SELECT ...
    result:
      rows:

aside from the structure, it's not that easy compared with the something as simple as proposed in the example:


require skip_reload

statement ok
ATTACH ':memory:' AS db1

statement ok
CREATE TABLE db1.integers(i INTEGER);

statement ok
INSERT INTO db1.integers VALUES (1), (2), (3), (NULL);

# now export the db
statement ok
EXPORT DATABASE db1 TO '__TEST_DIR__/export_test' (FORMAT CSV)

writing SQLs with yaml identation will be akward.

egor-ryashin commented 4 months ago

Being able to optionally test the export output instead of the query output

It also should be easy to run with the debugger (it's a frequent operation).

egor-ryashin commented 4 months ago

I should remind that this is so neat-looking because it's only SQL tests:


require skip_reload

statement ok
ATTACH ':memory:' AS db1

statement ok
CREATE TABLE db1.integers(i INTEGER);

statement ok
INSERT INTO db1.integers VALUES (1), (2), (3), (NULL);

# now export the db
statement ok
EXPORT DATABASE db1 TO '__TEST_DIR__/export_test' (FORMAT CSV)

and once more parameter added it will look as complicated as a Go unit-test. DuckDB needs to test a lot of SQL syntax and semantics and that's why those .test files were created - but if you need to test something more structured that requireds JSON or YAML configuration - that quickly goes sideways - and will look as a rendandnt work compared to using only Go unit-tests. So the decision should be to restrict text-fixtures to as small in structure as possible without plan to extend it - with knowing that simple structure along will cover a lot of test cases.

I should repeat that is not simple and I sense an expectation it should be extended:

egor-ryashin commented 4 months ago

I propose anything that can be parsed line by line, possible permissions can be added as an additional line:

statement ok
read-perm write-perm
CREATE TABLE db1.integers(i INTEGER);

statement ok
read-perm
INSERT INTO db1.integers VALUES (1), (2), (3), (NULL);

Other parameters should be excluded and any separation resolver/project seperation should be done by folder and filenames or ignored and done in specialized native Go-tests.

begelundmuller commented 4 months ago

We are not looking to test SQL here, we are looking to test our resolvers, which a) rely on project files being parsed, b) need to be tested against multiple OLAPs (DuckDB, ClickHouse, Druid), c) have nested query parameters.

I think the complexity here may make a flat text file tricky, but if you think it's possible, then please share a concrete proposal. If you think Go code is better, please suggest a syntax that incorporates all the considerations mentioned in the messages above.

egor-ryashin commented 4 months ago

So we have a directory structure right now like this:

testruntime
  adbids_druid (druid)
  adbids (duckdb)

the problem here is that each OLAP has it's own directory and configuration - means if we create a resolver-yaml inside one of them then we need to create a copy of that for another OLAP, like:

testruntime
  adbids_druid
     apis
       check_sql1.yaml
       check_sql2.yaml
  adbids
     apis
       check_sql1.yaml
       check_sql2.yaml

To reuse our current framework we need to create for each OLAP identical test projects that differ only in OLAP configuration.
Option 2. Generate or copy apis yaml files when a test is run (as apis yaml doesn't reference an OLAP type).

So in pseudo language there will be a resolvers_test.go with something like:

func TestResolvers(t *testing.T) {
    for dir := dirs("testruntime/testdata/*_resolver") { // add suffix 'resolver' to projects to indicate it's particular created for resolvers test    
        fmt.Println("running fixtures for engine"+dir")
                runAllFixtures(t. dir)
    }
}
func runAllFixtures(t *testing.T, dir) {
      cleanAPIFolder(dir)
      for fixture := files("testruntime/testdata/resolver_fixtures/*.test") {       
        createAPI(fixture, dir) // reading a fixture and generating an file in `apis` for the given instance project
      }
      instance := createInstance(dir)
      runAPITests(t, instance)
}

Where the fixture for a SQL resolver should have a name sql_\<name>.test and the content:

statement ok
read-perm write-perm
param1 bid param2 t1
SELECT param1 from param2;
—————
bid
1
2

or yaml syntax:

statement: ok
permissions: 
  -  read-perm 
  -  write-perm
params:
    param1: bid 
    param2: t1
sql: SELECT param1 from param2;
output: |\n
bid\n
1\n
2\n

Considering we need to generate apis for arbitrary resolvers it's clear now Yaml will be easier to use. Where runAPITests will do something like:

for fn := files("testruntime/testdata/resolver_fixtures/*.test") {
  t.Run("engine"+instance+"fixture"+fn, func(t *testing.T) {
  f:=readFixture(fn)
  api:= instance.APIForName(f.name)
  result := api.Resolve(convertToResolveArguments(f))
  require.Equal(t, output(f), toString(result))
 }
}

If we need to debug a particular fixture we can write an ad-hoc testing function like:


func TestDebugFixture1(t *testing.T) {
      cleanAPIFolder(enginedir)
          createAPI(fixture_path, enginedir)
      instance := createInstance(enginedir)
          f:=readFixture(fixture_path)
      api:= instance.APIForName(f.name)
          result := api.Resolve(convertToResolveArguments(f))
          require.Equal(t, output(f), toString(result))
}

begelundmuller commented 4 months ago

Please take a deeper look at the two first comments on this issue. Notably, the goal is to:

Not rely on project files in testruntime, but instead provide the project files directly in the test file for a group of tests. There are multiple reasons for this:
- Makes it easier to customize project files for the targeted OLAP and specific test case
- Makes it easy to add new test cases without breaking project files relied upon by tests in other files
- Makes it easy to develop tests because the project files and test definition are next to each other
Primarily to test the metrics: resolver that takes a Query (link). Unlike APIs and metrics SQL, this object can be quite nested.
Be able to tests complex edge cases like resolvers with security claims, using project variables, etc.

egor-ryashin commented 4 months ago

So it will look like we merge testruntime/testdata/adbids/**/*.yaml into a single test.yaml (but it contains additionally tests and connectors too), it's one level more complex because we need to parse the whole project differently and write another initialization flow for an instance, but I get your idea.

egor-ryashin commented 4 months ago

Considering this:

files:
  model.sql: SELECT ...
  metrics_view.yaml:
    type: metrics_view
    dimensions:
      - ...
    measures:
      - ...
connectors:
  - duckdb
  - clickhouse
tests:
  test_name:
    resolver: metrics
    properties:
      query:
        metrics_view: mv
        ...
    result:
      rows:
        - ...
      csv: >
        xxx,xxx,xxx
  other_test_name:
    variables:
      rill.metrics.approximate_comparisons: true
    claims:
      attributes:
      - ...
      rules:
      - ...
    resolver: metrics_sql
    properties:
      sql: SELECT ...
    result:
      rows:

For each connector we need to create an instance and give it (instead of repo connector) the test-yaml connector that will read the project from the yaml.

    inst := &drivers.Instance{
        Environment:      "test",
        OLAPConnector:    "duckdb",
        RepoConnector:    "repo",
        CatalogConnector: "catalog",
        Connectors: []*runtimev1.Connector{
            {
                Type:   "file",
                Name:   "repo",
                Config: map[string]string{"dsn": projectPath},
            },
            {
                Type:   "duckdb",
                Name:   "duckdb",
                Config: map[string]string{"dsn": ":memory:"},
            },
            {
                Type: "sqlite",
                Name: "catalog",
                // Setting a test-specific name ensures a unique connection when "cache=shared" is enabled.
                // "cache=shared" is needed to prevent threading problems.
                Config: map[string]string{"dsn": fmt.Sprintf("file:%s?mode=memory&cache=shared", t.Name())},
            },
        },
        EmbedCatalog: true,
    }

TestResolvers_test.go transforms in something like:

  for suite := files("testdata/*.suite") {
    for c := readConnectors(suite) {
      i := createInstance(createRepoConnector(suite)) // creating 'test-yaml' connector that provides the project content from `files` field
      for test := readTests(suite) {
        t.Run(test.name, func() {
          runTest(i, test) 
        })
      }
    }
  }

egor-ryashin commented 4 months ago

Frankly, some points still trouble me.

Makes it easy to develop tests because the project files and test definition are next to each other

Considering this, I wonder why we designed a project configuration with separate files initially, maybe we should redesign the project configuration to a single file? Right now we are planning to create different project configuration format for tests/production - that can lead to additional complexity.

Makes it easy to add new test cases without breaking project files relied upon by tests in other files

This is relates to previous. But I could add additionally, the problem with suddenly-broken tests is solved by creating another project in testdata. You will encounter exactly the same problem with a single yaml file once it will grow larger, you'll have a desire to tweak a connector slightly for a new test and suddenly some other tests in the file are broken. Besides copy-pasting a file is as easy as copying a folder.

Makes it easy to develop tests because the project files and test definition are next to each other

Yes, but why was the separation of projects from tests and placing them in testdata initially? We can place them together in the same folder.

begelundmuller commented 4 months ago

For each connector we need to create an instance and give it (instead of repo connector) the test-yaml connector that will read the project from the yaml.

Not necessarily – you can just use t.TempDir to create a temporary directory, and write the test files into it. That's what the parser tests do: https://github.com/rilldata/rill/blob/96bef3505aa43d040840ed57dfaea1a5c868beec/runtime/compilers/rillv1/parser_test.go#L1903

I wonder why we designed a project configuration with separate files initially, maybe we should redesign the project configuration to a single file?

That would be nice and has been requested previously. A single file would be especially nice for small projects (like test projects!). But it's out of scope for this work.

Right now we are planning to create different project configuration format for tests/production

By taking the same approach as the parser tests, where files are declared inline but written out to a temp directory, this will not be a problem (because the actual parser will still be parsing multiple files on disk).

You will encounter exactly the same problem with a single yaml file once it will grow larger, you'll have a desire to tweak a connector slightly for a new test and suddenly some other tests in the file are broken. Besides copy-pasting a file is as easy as copying a folder.

So the goal here is exactly to be able to just copy-paste one file and change it. It means you can add a new test by adding a single file. Copy/pasting a folder in a separate location from the test means future maintainers need to open and diff many files to understand the difference. That's what would be nice to avoid.

Yes, but why was the separation of projects from tests and placing them in testdata initially? We can place them together in the same folder.

Yeah the goal here is to correct that mistake by moving test data and tests closer to each other.

egor-ryashin commented 4 months ago

FYI, the datasource will be a separate file anyway.

egor-ryashin commented 4 months ago

An example of the project in a single yaml:

project:
  sources:
    ad_bids_mini_source.yaml:
      connector: local_file
      path: data/AdBids.csv.gz
  models:
    ad_bids_mini.sql: |
      select
        id,
        timestamp,
        publisher,
        domain,
        volume,
        impressions,
        clicks
      from ad_bids_mini_source
  dashboards:
    ad_bids_mini_metrics_with_policy.yaml:
      model: ad_bids_mini
      display_name: Ad bids
      description:

      timeseries: timestamp
      smallest_time_grain: ""

      dimensions:
      - label: Publisher
          name: publisher
          expression: upper(publisher)
          description: ""
      - label: Domain
          property: domain
          description: ""

      measures:
      - label: "Number of bids"
          name: bid's number
          expression: count(*)
      - label: "Total volume"
          name: total volume
          expression: sum(volume)
      - label: "Total impressions"
          name: total impressions
          expression: sum(impressions)
      - label: "Total clicks"
          name: total click"s
          expression: sum(clicks)

      security:
      access: true
      row_filter: "domain = '{{ .user.domain }}'"  
      exclude:
          - if: "'{{ .user.domain }}' != 'msn.com'"
          names: 
              - total volume

  apis:
      mv_sql_policy_api.yaml:
          kind : api

          metrics_sql: | 
          select 
              publisher,
              domain, 
              "total impressions", 
              "total volume" 
          FROM 
            ad_bids_mini_metrics_with_policy 
connectors:
  - duckdb
  - clickhouse
tests:
  test_name:
    resolver: metrics
    properties:
      query:
        metrics_view: mv
        ...
    result:
      rows:
        - ...
      csv: >
        xxx,xxx,xxx
  other_test_name:
    variables:
      rill.metrics.approximate_comparisons: true
    claims:
      attributes:
      - ...
      rules:
      - ...
    resolver: metrics_sql
    properties:
      sql: SELECT ...
    result:
      rows:

begelundmuller commented 4 months ago

An example of the project in a single yaml:

Looks pretty good to me.

project:
  sources:
    ad_bids_mini_source.yaml:

Could this be flattened to something like this?

project:
  sources/ad_bids_mini_source.yaml:

egor-ryashin commented 4 months ago

https://github.com/rilldata/rill/pull/5317