sodadata / soda-sql

Soda SQL and Soda Spark have been deprecated and replaced by Soda Core. docs.soda.io/soda-core/overview.html
https://docs.soda.io/
Apache License 2.0
61 stars 16 forks source link

No error is thrown when including tests for two tables in one scan.yml file #175

Open bjornvandijkman-ingka opened 2 years ago

bjornvandijkman-ingka commented 2 years ago

For me it seemed intuitive to include tests for multiple tables in one scan.yml file as follows:

table_name: orders
metrics:
  - distinct
samples:
  table_limit: 50
columns:
  id:
    valid_format: uuid
    tests:
      - distinct == 99
  status:
    tests:
      - distinct == 5 

table_name: customers
metrics:
  - missing_count
  - missing_percentage
  - min_length
  - distinct
samples:
  table_limit: 50
metric_groups:
  - profiling
  - duplicates
columns:
  id:
    valid_format: uuid
    tests:
      - invalid_percentage == 0
      - missing_count == 0
      - distinct == 100
  first_name:
    tests:
      - min_length > 1
  last_name:
    tests:
      - min_length > 1
      - invalid_count == 0

Doing this results in only the customers table being used, while ignoring the orders table. I think it would be nice to have support for testing multiple tables in one file, but before such functionality is implemented it would be user-friendly if a warning/error was thrown that soda currently cannot handle testing multiple tables in one file.

anilkulkarni87 commented 2 years ago

This is because of the default behaviour of PyYAML which overwrites the data. Can be resolved by writing Custom Loader. Here is an example:

# special loader with duplicate key checking
class UniqueKeyLoader(yaml.SafeLoader):
    def construct_mapping(self, node, deep=False):
         mapping = []
         for key_node, value_node in node.value:
             key = self.construct_object(key_node, deep=deep)
             assert key not in mapping, f"Duplicate key in Yaml File: {key}"
             mapping.append(key)
         return super().construct_mapping(node, deep)

And then we can use this by calling:

    filename='soda.yml'
    yaml_text = open(filename, 'r').read()
    data = yaml.load(yaml_text, Loader=UniqueKeyLoader)

The error looks like this: AssertionError: Duplicate key in Yaml File: table_name

The error could be customized as we wish to.

vijaykiran commented 2 years ago

@anilkulkarni87 that's very nice approach - would you like to open a PR with this?

anilkulkarni87 commented 2 years ago

@vijaykiran Yes I will work on it.

anilkulkarni87 commented 2 years ago

@vijaykiran I have created Draft pull request: https://github.com/sodadata/soda-sql/pull/624 I do have some questions though. Please take a look.

anilkulkarni87 commented 2 years ago

I will also have to make a change at : https://github.com/sodadata/soda-sql/blob/8a75b53902615d2724ed17c6560d4ec936dc449a/core/sodasql/scan/parser.py#L75