sodadata / soda-sql

Soda SQL and Soda Spark have been deprecated and replaced by Soda Core. docs.soda.io/soda-core/overview.html
https://docs.soda.io/
Apache License 2.0
60 stars 15 forks source link

Add support for `include` and `read` functions #248

Open dirkgroenen opened 3 years ago

dirkgroenen commented 3 years ago

As discussed earlier last weekend's code restructure should allow for more freedom when defining a soda-sql project directory structure. To accomplish this we no longer make 'directory' assumptions and instead allow developers to inline their sql_metrics in the scan-yml file.

In order to prevent scan-yml files from getting too large we should also support 'macro functions' which allow developers to include multiple files or read the contents of a file. As proposed we should consider adding support for either one, or both, of the following functions:

An example of a scan-YML file which uses read() could then look like:

// ./tables/orders.yml
table_name: CUSTOMERS
metrics:
  - row_count
sql_metrics:
  # Use inline custom metrics
  - sql: |
      SELECT
        COUNT(CASE WHEN O_COUNTRY IS "NL" THEN 1 END) as dutch_customers,
      FROM CUSTOMERS
    tests:
      no_dutch_customers: dutch_customers == 0
  # Use inline custom metrics where the SQL is defined in a separate file
  - sql: read(./sql/very_complex_sql.sql)
    tests:
      order_size: order_size > 1000
tests:
  dataset_size: row_count == 0

An example using include() would look like:

// ./tables/orders.yml
table_name: ORDERS
metrics:
  - row_count
  - sum
columns:
  O_ORDERSTATUS:
    valid_values:
      - O
      - F
sql_metrics: include(./metrics/orders/*.yml)
tests:
  dataset_size: row_count == 0
// ./metrics/orders/my_metric.yml
sql: |
  SELECT
    COUNT(CASE WHEN O_ORDERSTATUS IS "P" THEN 1 END) as order_p,
    COUNT(CASE WHEN O_ORDERSTATUS IS "O" THEN 1 END) as order_o,
  FROM ORDERS
tests:
  open_orders: order_o == 0
  processing_orders: order_p > 0

Up for discussion:

My two cents: merging into one might create more automagical behavior and can therefore be less preferred.

tombaeyens commented 3 years ago

Reflecting on this, my current thought is that we should go for a strategy to strictly separate the interpretation of the keys. I think that will lead to better error messages and simpler documentation.

sql_metrics -> always expect a list of dicts sql_metrics_includes -> expects a list of file paths (with optional wild cards)

Similarly sql_metrics.sql_file -> file path

This would imply we may have to revisit the named tests as we currently also allow for 2 flavours on 1 key tests. We could change that to tests -> list of unnamed test expressions named_tests -> dictionary of tests

Other thoughts?

dirkgroenen commented 3 years ago

I remember your suggested flavor to have been discussed in the engineering meeting leading up to the proposal as in this issue's OP, but there were some motivations from the team against it which were mostly related to verbosity, maintenance and reduced flexibility.

Some arguments for using include/read functions which I can remember:

I can't remember all arguments from the discussion, so don't pin me on it.

My personal vote goes for the "function" approach given it's a universal solution which only has to be documented and explained once, where as the verbose xxx_list, xxx_file, xxx_include requires (imo) more documentation.

tombaeyens commented 3 years ago

So for sql_metrics you reproposing to go for offering different options like this?

sql_metrics: include(./path/*.yml)
sql_metrics: 
    - include(./path/*.yml)
    - read(./anotherpath/someother.yml)

and

sql_metrics: 
    - sql: ...
      tests: ...
    - sql: ...
      tests: ...

Where the last 2 could actually be combined.

tombaeyens commented 3 years ago

And also

sql_metrics: 
    - sql: ...
      tests: ...
    - sql: read(./anotherpath/someother.sql)
      tests: ...
dirkgroenen commented 3 years ago

I personally do like that flavor, yes

dirkgroenen commented 3 years ago

Note how read could also be used to resolve https://github.com/sodadata/soda-sql/issues/166.

tombaeyens commented 3 years ago

We should make sure sodadata/soda-sql#267 this is fixed as we implement this feature ([bug] sql_file is recongnized as valid sql_metric key sodadata/soda-sql#267)