great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
9.71k stars 1.5k forks source link

DataAssistant cannot handle empty values #8375

Open MarcelBeining opened 1 year ago

MarcelBeining commented 1 year ago

Describe the bug I wanted to use the Assistant for profiling my Data in Kedro, however most of the columns produce errors during profiling because they also contain None values. Wondering why I am the first one reporting this.

To Reproduce Run this code

import pandas as pd
import great_expectations as gx
ge_context = gx.data_context.DataContext()

 result = ge_context.assistants.onboarding.run(
            batch_request={
        "runtime_parameters": {"batch_data": pd.DataFrame([{'a': True, 'b': 0}, {'a': False, 'b': 0}, {'b': 0}])},
        "data_connector_name": "default_runtime_data_connector_name",
        "datasource_name": "validation_datasource",
        "data_asset_name": "asset",
        "batch_identifiers": {
            "default_identifier_name": "default_identifier"
        }
    },
            exclude_column_names=[],
        )

_greatexpectations.yml

# Welcome to Great Expectations! Always know what to expect from your data.
#
# Here you can define datasources, batch kwargs generators, integrations and
# more. This file is intended to be committed to your repo. For help with
# configuration please:
#   - Read our docs: https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/connect_to_data_overview/#2-configure-your-datasource
#   - Join our slack channel: http://greatexpectations.io/slack

# config_version refers to the syntactic version of this config file, and is used in maintaining backwards compatibility
# It is auto-generated and usually does not need to be changed.
config_version: 3.0

# Datasources tell Great Expectations where your data lives and how to get it.
# You can use the CLI command `great_expectations datasource new` to help you
# add a new datasource. Read more at https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/connect_to_data_overview
datasources:
  validation_datasource:
    module_name: great_expectations.datasource
    data_connectors:
      default_runtime_data_connector_name:
        module_name: great_expectations.datasource.data_connector
        class_name: RuntimeDataConnector
        batch_identifiers:
          - default_identifier_name
    class_name: Datasource
    execution_engine:
      module_name: great_expectations.execution_engine
      class_name: PandasExecutionEngine

# This config file supports variable substitution which enables: 1) keeping
# secrets out of source control & 2) environment-based configuration changes
# such as staging vs prod.
#
# When GE encounters substitution syntax (like `my_key: ${my_value}` or
# `my_key: $my_value`) in the great_expectations.yml file, it will attempt
# to replace the value of `my_key` with the value from an environment
# variable `my_value` or a corresponding key read from this config file,
# which is defined through the `config_variables_file_path`.
# Environment variables take precedence over variables defined here.
#
# Substitution values defined here can be a simple (non-nested) value,
# nested value such as a dictionary, or an environment variable (i.e. ${ENV_VAR})
#
#
# https://docs.greatexpectations.io/docs/guides/setup/configuring_data_contexts/how_to_configure_credentials

config_variables_file_path: uncommitted/config_variables.yml

# The plugins_directory will be added to your python path for custom modules
# used to override and extend Great Expectations.
plugins_directory: plugins/

stores:
# Stores are configurable places to store things like Expectations, Validations
# Data Docs, and more. These are for advanced users only - most users can simply
# leave this section alone.
#
# Three stores are required: expectations, validations, and
# evaluation_parameters, and must exist with a valid store entry. Additional
# stores can be configured for uses such as data_docs, etc.
  expectations_store:
    class_name: ExpectationsStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: expectations/

  validations_store:
    class_name: ValidationsStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: uncommitted/validations/

  evaluation_parameter_store:
    class_name: EvaluationParameterStore
  checkpoint_store:
    class_name: CheckpointStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      suppress_store_backend_id: true
      base_directory: checkpoints/

  profiler_store:
    class_name: ProfilerStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      suppress_store_backend_id: true
      base_directory: profilers/

expectations_store_name: expectations_store
validations_store_name: validations_store
evaluation_parameter_store_name: evaluation_parameter_store
checkpoint_store_name: checkpoint_store

data_docs_sites:
  # Data Docs make it simple to visualize data quality in your project. These
  # include Expectations, Validations & Profiles. The are built for all
  # Datasources from JSON artifacts in the local repo including validations &
  # profiles from the uncommitted directory. Read more at https://docs.greatexpectations.io/docs/terms/data_docs
  local_site:
    class_name: SiteBuilder
    show_how_to_buttons: true
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: uncommitted/data_docs/local_site/
    site_index_builder:
      class_name: DefaultSiteIndexBuilder

anonymous_usage_statistics:
  data_context_id: 43f5a1f6-975d-4b98-a685-01d98ca9ecba
  enabled: true
notebooks:
include_rendered_content:
  expectation_validation_result: false
  expectation_suite: false
  globally: false

Error Traceback_

File "...\lib\site-packages\great_expectations\rule_based_profiler\data_assistant\data_assistant_runner.py", line 172, in run data_assistant_result: DataAssistantResult = data_assistant.run( File "...\lib\site-packages\great_expectations\rule_based_profiler\data_assistant\data_assistant.py", line 538, in run run_profiler_on_data( File "...\lib\site-packages\great_expectations\util.py", line 223, in compute_delta_t return func(*args, kwargs) File "...\lib\site-packages\great_expectations\rule_based_profiler\data_assistant\data_assistant.py", line 726, in run_profiler_on_data rule_based_profiler_result: RuleBasedProfilerResult = profiler.run( File "...\lib\site-packages\great_expectations\core\usage_statistics\usage_statistics.py", line 304, in usage_statistics_wrapped_method result = func(*args, *kwargs) File "...\lib\site-packages\great_expectations\rule_based_profiler\rule_based_profiler.py", line 325, in run rule_state = rule.run( File "...\lib\site-packages\great_expectations\util.py", line 223, in compute_delta_t return func(args, kwargs) File "...\lib\site-packages\great_expectations\rule_based_profiler\rule\rule.py", line 167, in run expectation_configuration_builder.resolve_validation_dependencies( File "...\lib\site-packages\great_expectations\rule_based_profiler\expectation_configuration_builder\expectation_configuration_builder.py", line 122, in resolve_validation_dependencies File "...\lib\site-packages\great_expectations\rule_based_profiler\expectation_configuration_builder\expectation_configuration_builder.py", line 122, in resolve_validation_dependencies
validation_parameter_builder.build_parameters( File "...\lib\site-packages\great_expectations\rule_based_profiler\parameter_builder\parameter_builder.py", line 167, in build_parameters parameter_computation_result: Attributes = parameter_computation_impl( File "...\lib\site-packages\great_expectations\rule_based_profiler\parameter_builder\numeric_metric_range_multi_batch_parameter_builder.py", line 383, in _build_parameters self._estimate_metric_value_range( File "...\lib\site-packages\great_expectations\rule_based_profiler\parameter_builder\numeric_metric_range_multi_batch_parameter_builder.py", line 616, in _estimate_metric_value_range numeric_range_estimator.get_numeric_range_estimate( File "...\lib\site-packages\great_expectations\rule_based_profiler\estimators\numeric_range_estimator.py", line 73, in get_numeric_range_estimate return self._get_numeric_range_estimate( File "...\lib\site-packages\great_expectations\rule_based_profiler\estimators\exact_numeric_range_estimator.py", line 63, in _get_numeric_range_estimate min_value: MetricValue = np.amin(a=metric_values_converted) File "<__array_function__ internals>", line 180, in amin File "...\lib\site-packages\numpy\core\fromnumeric.py", line 2916, in amin return _wrapreduction(a, np.minimum, 'min', axis, None, out, File "...\lib\site-packages\numpy\core\fromnumeric.py", line 86, in _wrapreduction return ufunc.reduce(obj, axis, dtype, out, **passkwargs) ValueError: zero-size array to reduction operation minimum which has no identity

Expected behavior Fill my expectation suite with valuable expectations about column a and b

Environment (please complete the following information):

HaebichanGX commented 1 year ago

Hi @MarcelBeining thank you for sharing this with us. So this is the old, block config-driven way of utilizing data assistant that we no longer support, so issues will not be bound to surface. We have moved on to the Fluent Data Source style, please see the docs on it here: https://docs.greatexpectations.io/docs/guides/expectations/data_assistants/how_to_create_an_expectation_suite_with_the_onboarding_data_assistant/

MarcelBeining commented 11 months ago

Same error using the fluent version:

import great_expectations as ge
import pandas as pd

ge_context = ge.data_context.DataContext()
datasource = ge_context.sources.add_pandas(name="validation_datasource")
data_asset = datasource.add_dataframe_asset(name="asset")
batch_request = data_asset.build_batch_request(dataframe=pd.DataFrame([{'a': True, 'b': 0},
                                                                       {'a': False, 'b': 0},
                                                                       {'b': 0}]))
result = ge_context.assistants.onboarding.run(
    batch_request=batch_request,
    exclude_column_names=[],
)

ValueError: zero-size array to reduction operation minimum which has no identity

The error does not seem to appear with newest version 0.17.9, however I am forced to keep with version 0.16.5 as long as the problem from #8387 / #8392 persists :-/