target / data-validator

A tool to validate data, built around Apache Spark.
Other
101 stars 34 forks source link

ColumnSumCheck should treat an unexpected type as a test failure, not exception #46

Open colindean opened 4 years ago

colindean commented 4 years ago

@samratmitra-0812 pointed out:

This behaviour of throwing an exception for unsupported type [in columnSumCheck] is different from columnMaxCheck, where it is treated as a normal check failure. I think we should make both of them consistent.

phpisciuneri commented 4 years ago

@colindean @samratmitra-0812 I think handling as an exception in ColumnMaxCheck is more appropriate. In both cases it is an edge that isn't covered, rather than a logical failure of the test. What do you think?

colindean commented 4 years ago

I can see it either way. Think of the failure mode when DV encounters an unsupported column type:

  1. Log and fail the individual check
  2. Explode

DV expects the user to inspect failures in the report at the end of a run. Should DV expect the user to know the type of the data in a column, especially when DV's config cannot necessarily reflect that type? That is, YAML only knows JSON numeric types, which are integer and double (I think they're called "fractions" technically in JSON) and exponential.

Also, is it more user-friendly to allow a validation run to complete or should it die immediately at runtime?

Immediate exit for normally long-running processes should be focused at the start of the run. If there's a way we can detect supported types in configCheck, we should do that if DV will exit ungracefully when it encounters an unhandled case.

One of the things I liked about my expressions approach to columnSumCheck was pushing the type handling to the Spark/Hive level, so we didn't need to maintain a list of supported data types that generally ends up being an exhaustive list of subtypes of NumericType without a way to ensure that we've handled all cases exhaustively.

phpisciuneri commented 4 years ago

Immediate exit for normally long-running processes should be focused at the start of the run. If there's a way we can detect supported types in configCheck, we should do that if DV will exit ungracefully when it encounters an unhandled case.

@colindean This is a good point. Maybe too general of a statement, but at level of config check we can throw exceptions, it will be a fast failure to your point. And at deeper levels (like quick check) we just log the error so that other checks still have the chance to complete.

samratmitra-0812 commented 4 years ago

@colindean @phpisciuneri Datatype verification in configCheck is being done in NegativeCheck and StringRegexCheck. But even there, it is being treated as a normal check failure.