sodadata / soda-sql

Soda SQL and Soda Spark have been deprecated and replaced by Soda Core. docs.soda.io/soda-core/overview.html
https://docs.soda.io/
Apache License 2.0
59 stars 16 forks source link

[Feature] Add support for boolean columns #157

Open HansBambel opened 2 years ago

HansBambel commented 2 years ago

Is your feature request related to a problem? Please describe. Currently boolean columns are unsupported. I was not able to find a reason for this in the docs. Is there a particular reason why this is not supported?

Describe the solution you'd like Have metrics that are able to use boolean columns. Right now I am using a custom sql_metric such as:

SUM(CASE WHEN is_clean = true THEN 1 ELSE 0 END) as clean_amount,
SUM(CASE WHEN is_clean = true THEN 1 ELSE 0 END)/COUNT(*)::float as clean_percent

Soda SQL Version: 2.1.0b18

vijaykiran commented 2 years ago

Hi Hans,

I can't see what kind of metrics we can have on boolean columns. One thing I can think of is count of true/false. Are there any other aggregations you can think of here?

HansBambel commented 2 years ago

Yes, I could think of the count as you mentioned and also the percentage of trueness.

For example we are having a column for a cleanly parsed document called is_clean. The metric for us would be that this column clean_percent is always above a certain threshold:

tests:
    - clean_percent > 80
vijaykiran commented 2 years ago

Okay. Got it. I think count and percentage go hand in hand.

mistynodine commented 2 years ago

This is a good idea, I am currently using custom metrics here as well. There is the additional issue that sometimes we see different representations for True and False. ('T', 'true', 'True', 1, '1', etc). Also with respect to the percentages, one thing to think about is how you factor in the entries that are invalid or null. So if you have 10 values, 2 are invalid, 4 are true and 4 are false, is that 40% true? Or is it the percent of the valid values which is 50%?