great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
9.92k stars 1.54k forks source link

Azure SQL Table Asset throwing “NoneType object is not iterable” while validating in azure databricks - gx 1.0.0 #10305

Closed DineshBaratam-5 closed 1 month ago

DineshBaratam-5 commented 1 month ago

Describe the bug After creating data context using dbfs path in azure databricks, connected to azure sql server data source and a table asset. Created a expectation suite and tried to validate the expectations on top of the table asset. Validator is throwing an issue shown as "NoneType object is not iterable". I have tried the same in local environment instead of databricks and got the same issue. Please note that the issue is with using SQL data asset to connect to Azure SQL Server but not Azure Storage account.

To Reproduce PFB the code snippet I executed in azure databricks cloud environment

` import great_expectations as gx from great_expectations.checkpoint import Checkpoint import pandas as pd import sqlalchemy as sa import sqlalchemy.engine as sae

context_root_dir = "/dbfs/great_expectations/" context = gx.get_context(mode="file", project_root_dir=context_root_dir)

connection_url = sae.URL.create( "mssql+pyodbc", username=targetDbUserName, password=targetDbPassword, host=targetSQLServerName, database=targetDataBaseName, query={"driver": "ODBC Driver 18 for SQL Server"} )

connectionString = connection_url.render_as_string(hide_password=False)

datasource = context.data_sources.add_or_update_sql(name="TargetSQLDataSource", connection_string=connectionString) asset_name = "ProductAsset" dataasset = datasource.add_table_asset(name=asset_name, table_name=targetTableName) batch_request = dataasset.build_batch_request() expectation_suite_name = "TargetSQL_expectation_suite" suite = gx.ExpectationSuite(expectation_suite_name) try: context.suites.add(suite) except gx.exceptions.DataContextError as DCE: context.suites.delete(expectation_suite_name) context.suites.add(suite) validator = context.get_validator( batch_request=batch_request, expectation_suite_name=expectation_suite_name )

validator.expect_column_values_to_not_be_null(column="ProductID") validator.expect_column_values_to_be_between(column="StandardCost", min_value=0, max_value=100000) validator.save_expectation_suite(discard_failed_expectations=False) `

After executing above code everything is working fine till creation of validator and an error is being thrown at the line "validator.expect_column_values_to_not_be_null(column="ProductID")". PFB the total error trace.

MetricResolutionError: 'NoneType' object is not iterable

TypeError Traceback (most recent call last) File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/great_expectations/execution_engine/execution_engine.py:533, in ExecutionEngine._process_direct_and_bundled_metric_computation_configurations(self, metric_fn_direct_configurations, metric_fn_bundle_configurations) 531 try: 532 resolved_metrics[metric_computation_configuration.metric_configuration.id] = ( --> 533 metric_computation_configuration.metric_fn( # type: ignore[misc] # F not callable 534 **metric_computation_configuration.metric_provider_kwargs 535 ) 536 ) 537 except Exception as e:

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/great_expectations/expectations/metrics/metric_provider.py:60, in metric_value..wrapper..inner_func(*args, kwargs) 58 @wraps(metric_fn) 59 def inner_func(*args: P.args, *kwargs: P.kwargs): ---> 60 return metric_fn(args, kwargs)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/great_expectations/expectations/metrics/table_metrics/table_columns.py:49, in TableColumns._sqlalchemy(cls, execution_engine, metric_domain_kwargs, metric_value_kwargs, metrics, runtime_configuration) 48 column_metadata = metrics["table.column_types"] ---> 49 return [col["name"] for col in column_metadata]

TypeError: 'NoneType' object is not iterable

The above exception was the direct cause of the following exception:

MetricResolutionError Traceback (most recent call last) File , line 1 ----> 1 validator.expect_column_values_to_not_be_null(column="ProductID") 2 # validator.expect_column_values_to_be_between(column="StandardCost", min_value=0, max_value=100000) 3 validator.save_expectation_suite(discard_failed_expectations=False)

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/great_expectations/validator/validator.py:547, in Validator.validate_expectation..inst_expectation(*args, **kwargs) 541 validation_result = ExpectationValidationResult( 542 success=False, 543 exception_info=exception_info, 544 expectation_config=configuration, 545 ) 546 else: --> 547 raise err # noqa: TRY201 549 if self._include_rendered_content: 550 validation_result.render()

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/great_expectations/validator/validator.py:505, in Validator.validate_expectation..inst_expectation(*args, **kwargs) 501 validation_result = ExpectationValidationResult( 502 expectation_config=copy.deepcopy(configuration) 503 ) 504 else: --> 505 validationresult = expectation.validate( 506 validator=self, 507 suite_parameters=self._expectation_suite.suite_parameters, 508 data_context=self._data_context, 509 runtime_configuration=basic_runtime_configuration, 510 ) 512 # If validate has set active_validation to true, then we do not save the config to avoid # noqa: E501 513 # saving updating expectation configs to the same suite during validation runs 514 if self._active_validation is True:

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/greatexpectations/expectations/expectation.py:1236, in Expectation.validate(self, validator, suite_parameters, interactive_evaluation, data_context, runtime_configuration) 1230 self._warn_if_result_format_config_in_expectation_configuration(configuration=configuration) 1232 configuration.process_suite_parameters( 1233 suite_parameters, interactive_evaluation, data_context 1234 ) 1235 expectation_validation_result_list: list[ExpectationValidationResult] = ( -> 1236 validator.graph_validate( 1237 configurations=[configuration], 1238 runtime_configuration=runtime_configuration, 1239 ) 1240 ) 1241 return expectation_validation_result_list[0]

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/great_expectations/validator/validator.py:640, in Validator.graph_validate(self, configurations, runtime_configuration) 638 return evrs 639 else: --> 640 raise err # noqa: TRY201 642 configuration: ExpectationConfiguration 643 result: ExpectationValidationResult

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/great_expectations/validator/validator.py:619, in Validator.graph_validate(self, configurations, runtime_configuration) 612 resolved_metrics: _MetricsDict 614 try: 615 ( 616 resolved_metrics, 617 evrs, 618 processed_configurations, --> 619 ) = self._resolve_suite_level_graph_and_process_metric_evaluation_errors( 620 graph=graph, 621 runtime_configuration=runtime_configuration, 622 expectation_validation_graphs=expectation_validation_graphs, 623 evrs=evrs, 624 processed_configurations=processed_configurations, 625 show_progress_bars=self._determine_progress_bars(), 626 ) 627 except Exception as err: 628 # If a general Exception occurs during the execution of "ValidationGraph.resolve()", then # noqa: E501 629 # all expectations in the suite are impacted, because it is impossible to attribute the failure to a metric. # noqa: E501 630 if catch_exceptions:

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/great_expectations/validator/validator.py:775, in Validator._resolve_suite_level_graph_and_process_metric_evaluation_errors(self, graph, runtime_configuration, expectation_validation_graphs, evrs, processed_configurations, show_progress_bars) 770 resolved_metrics: _MetricsDict 771 aborted_metrics_info: _AbortedMetricsInfoDict 772 ( 773 resolved_metrics, 774 aborted_metrics_info, --> 775 ) = self._metrics_calculator.resolve_validation_graph( 776 graph=graph, 777 runtime_configuration=runtime_configuration, 778 min_graph_edges_pbar_enable=0, 779 ) 781 # Trace MetricResolutionError occurrences to expectations relying on corresponding malfunctioning metrics. # noqa: E501 782 rejected_configurations: List[ExpectationConfiguration] = []

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/great_expectations/validator/metrics_calculator.py:265, in MetricsCalculator.resolve_validation_graph(self, graph, runtime_configuration, min_graph_edges_pbar_enable) 263 resolved_metrics: _MetricsDict 264 aborted_metrics_info: _AbortedMetricsInfoDict --> 265 resolved_metrics, aborted_metrics_info = graph.resolve( 266 runtime_configuration=runtime_configuration, 267 min_graph_edges_pbar_enable=min_graph_edges_pbar_enable, 268 show_progress_bars=self._show_progress_bars, 269 ) 270 return resolved_metrics, aborted_metrics_info

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/great_expectations/validator/validation_graph.py:205, in ValidationGraph.resolve(self, runtime_configuration, min_graph_edges_pbar_enable, show_progress_bars) 202 resolved_metrics: Dict[_MetricKey, MetricValue] = {} 204 # updates graph with aborted metrics --> 205 aborted_metrics_info: _AbortedMetricsInfoDict = self._resolve( 206 metrics=resolved_metrics, 207 runtime_configuration=runtime_configuration, 208 min_graph_edges_pbar_enable=min_graph_edges_pbar_enable, 209 show_progress_bars=show_progress_bars, 210 ) 212 return resolved_metrics, aborted_metrics_info

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/great_expectations/validator/validation_graph.py:305, in ValidationGraph._resolve(self, metrics, runtime_configuration, min_graph_edges_pbar_enable, show_progress_bars) 302 failed_metric_info[failed_metric.id]["exception_info"] = exception_info 304 else: --> 305 raise err # noqa: TRY201 306 except Exception as e: 307 if catch_exceptions:

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/great_expectations/validator/validation_graph.py:276, in ValidationGraph._resolve(self, metrics, runtime_configuration, min_graph_edges_pbar_enable, show_progress_bars) 271 computable_metrics.add(metric) 273 try: 274 # Access "ExecutionEngine.resolve_metrics()" method, to resolve missing "MetricConfiguration" objects. # noqa: E501 275 metrics.update( --> 276 self._execution_engine.resolve_metrics( 277 metrics_to_resolve=computable_metrics, # type: ignore[arg-type] # Metric typing needs further refinement. 278 metrics=metrics, # type: ignore[arg-type] # Metric typing needs further refinement. 279 runtime_configuration=runtime_configuration, 280 ) 281 ) 282 progress_bar.update(len(computable_metrics)) 283 progress_bar.refresh()

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/great_expectations/execution_engine/execution_engine.py:279, in ExecutionEngine.resolve_metrics(self, metrics_to_resolve, metrics, runtime_configuration) 270 metric_fn_bundle_configurations: List[MetricComputationConfiguration] 271 ( 272 metric_fn_direct_configurations, 273 metric_fn_bundle_configurations, (...) 277 runtime_configuration=runtime_configuration, 278 ) --> 279 return self._process_direct_and_bundled_metric_computation_configurations( 280 metric_fn_direct_configurations=metric_fn_direct_configurations, 281 metric_fn_bundle_configurations=metric_fn_bundle_configurations, 282 )

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/great_expectations/execution_engine/execution_engine.py:538, in ExecutionEngine._process_direct_and_bundled_metric_computation_configurations(self, metric_fn_direct_configurations, metric_fn_bundle_configurations) 532 resolved_metrics[metric_computation_configuration.metric_configuration.id] = ( 533 metric_computation_configuration.metric_fn( # type: ignore[misc] # F not callable 534 **metric_computation_configuration.metric_provider_kwargs 535 ) 536 ) 537 except Exception as e: --> 538 raise gx_exceptions.MetricResolutionError( 539 message=str(e), 540 failed_metrics=(metric_computation_configuration.metric_configuration,), 541 ) from e 543 try: 544 # an engine-specific way of computing metrics together 545 resolved_metric_bundle: Dict[Tuple[str, str, str], MetricValue] = ( 546 self.resolve_metric_bundle(metric_fn_bundle=metric_fn_bundle_configurations) 547 )

MetricResolutionError: 'NoneType' object is not iterable

Expected behavior Gx workflow should execute, and the validator should save the expectation suite after validating the given expectation.

Environment (please complete the following information):

adeola-ak commented 1 month ago

hello @DineshBaratam-5 im looking into this issue now

adeola-ak commented 1 month ago

@DineshBaratam-5 have you followed our documentation for 1.0? 1.0 introduced some breaking changes, largely around simplifying the interface, so I'd recommend reviewing the sample code as I believe that is the cause of your errors. for example:

# Create the Data Source:
data_source = context.data_sources.add_pandas_abs(
    name=data_source_name, azure_options=azure_options
)

is how you would create the data source now.

i will close this issue for now, can you let me know if you are still having this issue after updating the necessary methods? here is a working example (local since you said you also had the problem locally):

context = gx.get_context(mode="file")

connection_string = "${AZURE_STORAGE_CONNECTION_STRING}"

data_source_name = "azure_data_source"
azure_options = {
    "conn_str": connection_string
}

# Create and get Data Source:
data_source = context.data_sources.add_pandas_abs(
    name=data_source_name, azure_options=azure_options
)
data_source = context.get_datasource(data_source_name)

# Define Data Asset's parameters.
asset_name = "abs_file_csv_asset"
abs_container = "superconductive-public"
abs_prefix = "data/taxi_yellow_tripdata_samples/"

file_asset = data_source.add_csv_asset(
    name=asset_name, abs_container=abs_container, abs_name_starts_with=abs_prefix
)
file_asset = context.data_sources.get(data_source_name).get_asset(asset_name)

batch_definition_name = "yellow tripdata sample"
batch_definition_path = "yellow_tripdata_sample_2019-01.csv"

batch_definition = file_asset.add_batch_definition_path(
    name=batch_definition_name, path=batch_definition_path
)

batch = batch_definition.get_batch()
print(batch.head())
gundpm commented 1 month ago

Hi @adeola-ak thank you for your response, I am still facing this problem in GX 1.0.1, even though I have followed a new document writing style. The problem has occurred in the MSSQL connection, not the Azure storage account connection.

DineshBaratam-5 commented 1 month ago

@DineshBaratam-5 have you followed our documentation for 1.0? 1.0 introduced some breaking changes, largely around simplifying the interface, so I'd recommend reviewing the sample code as I believe that is the cause of your errors. for example:

# Create the Data Source:
data_source = context.data_sources.add_pandas_abs(
    name=data_source_name, azure_options=azure_options
)

is how you would create the data source now.

i will close this issue for now, can you let me know if you are still having this issue after updating the necessary methods? here is a working example (local since you said you also had the problem locally):

context = gx.get_context(mode="file")

connection_string = "${AZURE_STORAGE_CONNECTION_STRING}"

data_source_name = "azure_data_source"
azure_options = {
    "conn_str": connection_string
}

# Create and get Data Source:
data_source = context.data_sources.add_pandas_abs(
    name=data_source_name, azure_options=azure_options
)
data_source = context.get_datasource(data_source_name)

# Define Data Asset's parameters.
asset_name = "abs_file_csv_asset"
abs_container = "superconductive-public"
abs_prefix = "data/taxi_yellow_tripdata_samples/"

file_asset = data_source.add_csv_asset(
    name=asset_name, abs_container=abs_container, abs_name_starts_with=abs_prefix
)
file_asset = context.data_sources.get(data_source_name).get_asset(asset_name)

batch_definition_name = "yellow tripdata sample"
batch_definition_path = "yellow_tripdata_sample_2019-01.csv"

batch_definition = file_asset.add_batch_definition_path(
    name=batch_definition_name, path=batch_definition_path
)

batch = batch_definition.get_batch()
print(batch.head())

Hi @adeola-ak, I am not working with Azure storage account. I am working with Azure SQL Server. SO that is the reason why I am not using file system approach instead I am using SQL data source and necessary SQL connection string. Please reopen the issue. Please refer this link from GX Core documentation for more information on my references for the code I used. https://docs.greatexpectations.io/docs/core/connect_to_data/sql_data/?procedure=sample_code

adeola-ak commented 1 month ago

can you verify you have a successful connection to the data? the validator can not run without connecting to data and your previous code is referencing methods that no longer exist in 1.0. Without updating methods and ensuring a successful connection, you will not be able to run the validator.

For example this is how you would connect: context.data_sources.add_sql(name: str, connection_string:str)

Please provide the updated code