great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
9.92k stars 1.54k forks source link

Data Assistant fails for dates before the year 1970 #7481

Closed thoniTUB closed 1 year ago

thoniTUB commented 1 year ago

Describe the bug GX's Data Assistant fails when a date outside the range described by https://docs.python.org/3/library/datetime.html#datetime.datetime.utcfromtimestamp

... and OSError on gmtime() failure. It’s common for this to be restricted to years in 1970 through 2038.

is encountered.

To Reproduce Steps to reproduce the behavior:

  1. Create a CSV file test.csv with this content:
    date
    1970-01-01
    1969-01-01
  2. Create a pandas DataSource
  3. Create a new Expectation Suite with "Automatically, using a Data Assistant"
  4. In the notebook
    1. modify the BatchRequest to parse the date-column as a date:
        batch_request = {
            'datasource_name': 'my_datasource',
            'data_connector_name': 'default_inferred_data_connector_name',
            'data_asset_name': 'test.csv',
            'limit': 1000,
            "batch_spec_passthrough": {
                "reader_options": {
                    "parse_dates": [
                        "date"
                    ]
                }
            }
        }
    2. Remove date from the column exclusion list
  5. Run the notebook until it fails in the OnboardingDataAssistant-Cell with:
    
    OSError                                   Traceback (most recent call last)
    Cell In[5], line 1
    ----> 1 result = context.assistants.onboarding.run(
      2     batch_request=batch_request,
      3     exclude_column_names=exclude_column_names,
      4 )
      5 validator.expectation_suite = result.get_expectation_suite(
      6     expectation_suite_name=expectation_suite_name
      7 )

File :2, in run(batch_request, estimation, include_column_names, exclude_column_names, include_column_name_suffixes, exclude_column_name_suffixes, semantic_type_filter_module_name, semantic_type_filter_class_name, max_unexpected_values, max_unexpected_ratio, min_max_unexpected_values_proportion, allowed_semantic_types_passthrough, cardinality_limit_mode, max_unique_values, max_proportion_unique, table_rule, column_value_uniqueness_rule, column_value_nullity_rule, column_value_nonnullity_rule, numeric_columns_rule, datetime_columns_rule, text_columns_rule, categorical_columns_rule)

File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\rule_based_profiler\data_assistant\data_assistant_runner.py:176, in DataAssistantRunner.run_impl..run(batch_request, estimation, kwargs) 166 variables_directives_list: List[ 167 RuntimeEnvironmentVariablesDirectives 168 ] = build_variables_directives( (...) 171 variables_directives_kwargs, 172 ) 173 domain_type_directives_list: List[ 174 RuntimeEnvironmentDomainTypeDirectives 175 ] = build_domain_type_directives(**domain_type_directives_kwargs) --> 176 data_assistant_result: DataAssistantResult = data_assistant.run( 177 variables_directives_list=variables_directives_list, 178 domain_type_directives_list=domain_type_directives_list, 179 ) 180 return data_assistant_result

File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\rule_based_profiler\data_assistant\data_assistant.py:571, in DataAssistant.run(self, variables, rules, variables_directives_list, domain_type_directives_list) 565 batches: Dict[str, Union[Batch, FluentBatch]] = self._batches or {} 567 data_assistant_result = DataAssistantResult( 568 _batch_id_to_batch_identifier_display_name_map=self._batch_id_to_batch_identifier_display_name_map(), 569 _usage_statistics_handler=usage_statistics_handler, 570 ) --> 571 run_profiler_on_data( 572 data_assistant=self, 573 data_assistant_result=data_assistant_result, 574 profiler=self._profiler, 575 variables=variables, 576 rules=rules, 577 batch_list=list(batches.values()), 578 batch_request=None, 579 variables_directives_list=variables_directives_list, 580 domain_type_directives_list=domain_type_directives_list, 581 ) 582 return self._build_data_assistant_result( 583 data_assistant_result=data_assistant_result 584 )

File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\util.py:224, in measure_execution_time..execution_time_decorator..compute_delta_t(*args, *kwargs) 222 time_begin: float = (getattr(time, method))() 223 try: --> 224 return func(args, **kwargs) 225 finally: 226 time_end: float = (getattr(time, method))()

File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\rule_based_profiler\data_assistant\data_assistant.py:756, in run_profiler_on_data(data_assistant, data_assistant_result, profiler, variables, rules, batch_list, batch_request, variables_directives_list, domain_type_directives_list) 739 """ 740 This method executes "run()" of effective "RuleBasedProfiler" and fills "DataAssistantResult" object with outputs. 741 (...) 751 domain_type_directives_list: additional/override runtime domain directives (modify "BaseRuleBasedProfiler") 752 """ 753 comment: str = f"""Created by effective Rule-Based Profiler of {data_assistant.class.name} with the \ 754 configuration included. 755 """ --> 756 rule_based_profiler_result: RuleBasedProfilerResult = profiler.run( 757 variables=variables, 758 rules=rules, 759 batch_list=batch_list, 760 batch_request=batch_request, 761 runtime_configuration=None, 762 reconciliation_directives=DEFAULT_RECONCILATION_DIRECTIVES, 763 variables_directives_list=variables_directives_list, 764 domain_type_directives_list=domain_type_directives_list, 765 comment=comment, 766 ) 767 result: DataAssistantResult = data_assistant_result 768 result.profiler_config = profiler.config

File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\core\usage_statistics\usage_statistics.py:318, in usage_statistics_enabled_method..usage_statistics_wrapped_method(*args, kwargs) 315 args_payload = args_payload_fn(*args, *kwargs) or {} 316 nested_update(event_payload, args_payload) --> 318 result = func(args, kwargs) 319 message["success"] = True 320 except Exception:

File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\rule_based_profiler\rule_based_profiler.py:327, in BaseRuleBasedProfiler.run(self, variables, rules, batch_list, batch_request, runtime_configuration, reconciliation_directives, variables_directives_list, domain_type_directives_list, comment) 318 rule: Rule 319 for rule in pbar_method( 320 effective_rules, 321 desc="Generating Expectations:", (...) 325 bar_format="{desc:25}{percentage:3.0f}%|{bar}{r_bar}", 326 ): --> 327 rule_state = rule.run( 328 variables=effective_variables, 329 batch_list=batch_list, 330 batch_request=batch_request, 331 runtime_configuration=runtime_configuration, 332 reconciliation_directives=reconciliation_directives, 333 rule_state=RuleState(), 334 ) 335 self.rule_states.append(rule_state) 337 return RuleBasedProfilerResult( 338 fully_qualified_parameter_names_by_domain=self.get_fully_qualified_parameter_names_by_domain(), 339 parameter_values_for_fully_qualified_parameter_names_by_domain=self.get_parameter_values_for_fully_qualified_parameter_names_by_domain(), (...) 365 _usage_statistics_handler=self._usage_statistics_handler, 366 )

File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\util.py:224, in measure_execution_time..execution_time_decorator..compute_delta_t(*args, *kwargs) 222 time_begin: float = (getattr(time, method))() 223 try: --> 224 return func(args, **kwargs) 225 finally: 226 time_end: float = (getattr(time, method))()

File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\rule_based_profiler\rule\rule.py:175, in Rule.run(self, variables, batch_list, batch_request, runtime_configuration, reconciliation_directives, rule_state) 172 expectation_configuration_builder: ExpectationConfigurationBuilder 174 for expectation_configuration_builder in expectation_configuration_builders: --> 175 expectation_configuration_builder.resolve_validation_dependencies( 176 domain=domain, 177 variables=variables, 178 parameters=rule_state.parameters, 179 batch_list=batch_list, 180 batch_request=batch_request, 181 runtime_configuration=runtime_configuration, 182 ) 184 return rule_state

File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\rule_based_profiler\expectation_configuration_builder\expectation_configuration_builder.py:129, in ExpectationConfigurationBuilder.resolve_validation_dependencies(self, domain, variables, parameters, batch_list, batch_request, runtime_configuration) 127 validation_parameter_builder: ParameterBuilder 128 for validation_parameter_builder in validation_parameter_builders: --> 129 validation_parameter_builder.build_parameters( 130 domain=domain, 131 variables=variables, 132 parameters=parameters, 133 parameter_computation_impl=None, 134 batch_list=batch_list, 135 batch_request=batch_request, 136 runtime_configuration=runtime_configuration, 137 )

File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\rule_based_profiler\parameter_builder\parameter_builder.py:191, in ParameterBuilder.build_parameters(self, domain, variables, parameters, parameter_computation_impl, batch_list, batch_request, runtime_configuration) 188 if parameter_computation_impl is None: 189 parameter_computation_impl = self._build_parameters --> 191 parameter_computation_result: Attributes = parameter_computation_impl( 192 domain=domain, 193 variables=variables, 194 parameters=parameters, 195 runtime_configuration=runtime_configuration, 196 ) 198 parameter_values: Dict[str, Any] = { 199 self.raw_fully_qualified_parameter_name: parameter_computation_result, 200 self.json_serialized_fully_qualified_parameter_name: convert_to_json_serializable( 201 data=parameter_computation_result 202 ), 203 } 205 build_parameter_container( 206 parameter_container=parameters[domain.id], 207 parameter_values=parameter_values, 208 )

File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\rule_based_profiler\parameter_builder\numeric_metric_range_multi_batch_parameter_builder.py:402, in NumericMetricRangeMultiBatchParameterBuilder._build_parameters(self, domain, variables, parameters, runtime_configuration) 386 round_decimals = self._get_round_decimals_using_heuristics( 387 metric_values=metric_values, 388 domain=domain, 389 variables=variables, 390 parameters=parameters, 391 ) 393 numeric_range_estimator: NumericRangeEstimator = ( 394 self._build_numeric_range_estimator( 395 round_decimals=round_decimals, (...) 399 ) 400 ) 401 numeric_range_estimation_result: NumericRangeEstimationResult = ( --> 402 self._estimate_metric_value_range( 403 metric_values=metric_values, 404 numeric_range_estimator=numeric_range_estimator, 405 round_decimals=round_decimals, 406 domain=domain, 407 variables=variables, 408 parameters=parameters, 409 ) 410 ) 412 value_range: np.ndarray = numeric_range_estimation_result.value_range 413 details: Dict[str, Any] = copy.deepcopy( 414 parameter_node[FULLY_QUALIFIED_PARAMETER_NAME_METADATA_KEY] 415 )

File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\rule_based_profiler\parameter_builder\numeric_metric_range_multi_batch_parameter_builder.py:634, in NumericMetricRangeMultiBatchParameterBuilder._estimate_metric_value_range(self, metric_values, numeric_range_estimator, round_decimals, domain, variables, parameters) 626 numeric_range_estimation_result = build_numeric_range_estimation_result( 627 metric_values=metric_value_vector, 628 min_value=metric_value_vector[0], 629 max_value=metric_value_vector[0], 630 ) 631 else: 632 # Compute low and high estimates for vector of samples for given element of multi-dimensional metric. 633 numeric_range_estimation_result = ( --> 634 numeric_range_estimator.get_numeric_range_estimate( 635 metric_values=metric_value_vector, 636 domain=domain, 637 variables=variables, 638 parameters=parameters, 639 ) 640 ) 642 min_value = numeric_range_estimation_result.value_range[0] 643 if lower_bound is not None:

File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\rule_based_profiler\estimators\numeric_range_estimator.py:73, in NumericRangeEstimator.get_numeric_range_estimate(self, metric_values, domain, variables, parameters) 55 def get_numeric_range_estimate( 56 self, 57 metric_values: np.ndarray, (...) 60 parameters: Optional[Dict[str, ParameterContainer]] = None, 61 ) -> NumericRangeEstimationResult: 62 """ 63 Method that invokes implementation of the estimation algorithm that is the subject of the inherited class. 64 Args: (...) 71 "NumericRangeEstimationResult" object, containing computed "value_range" and "estimation_histogram" details. 72 """ ---> 73 return self._get_numeric_range_estimate( 74 metric_values=metric_values, 75 domain=domain, 76 variables=variables, 77 parameters=parameters, 78 )

File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\rule_based_profiler\estimators\exact_numeric_range_estimator.py:67, in ExactNumericRangeEstimator._get_numeric_range_estimate(self, metric_values, domain, variables, parameters) 65 min_value: Number = np.amin(a=metric_values_converted) 66 max_value: Number = np.amax(a=metric_values_converted) ---> 67 return build_numeric_range_estimation_result( 68 metric_values=metric_values_converted, 69 min_value=min_value, 70 max_value=max_value, 71 )

File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\rule_based_profiler\helpers\util.py:881, in build_numeric_range_estimation_result(metric_values, min_value, max_value) 879 histogram = np.histogram(a=metric_values_converted, bins=NUM_HISTOGRAM_BINS) 880 # Use "UTC" TimeZone normalization in "bin_edges" when "metric_values" consists of "datetime.datetime" objects. --> 881 bin_edges = convert_ndarray_float_to_datetime_dtype(data=histogram[1]) 882 else: 883 histogram = np.histogram(a=metric_values, bins=NUM_HISTOGRAM_BINS)

File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\util.py:1699, in convert_ndarray_float_to_datetime_dtype(data) 1693 """ 1694 Convert all elements of 1-D "np.ndarray" argument from "float" type to "datetime.datetime" type objects. 1695 1696 Note: Converts to "naive" "datetime.datetime" values (assumes "UTC" TimeZone based floating point timestamps). 1697 """ 1698 value: Any -> 1699 return np.asarray([datetime.datetime.utcfromtimestamp(value) for value in data])

File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\util.py:1699, in (.0) 1693 """ 1694 Convert all elements of 1-D "np.ndarray" argument from "float" type to "datetime.datetime" type objects. 1695 1696 Note: Converts to "naive" "datetime.datetime" values (assumes "UTC" TimeZone based floating point timestamps). 1697 """ 1698 value: Any -> 1699 return np.asarray([datetime.datetime.utcfromtimestamp(value) for value in data])

OSError: [Errno 22] Invalid argument



**Expected behavior**
Any valid date can be processed by the Data Assistant.

**Environment:**
 - Operating System: Windows
 - Great Expectations Version: 0.16.3

**Additional context**
As indicated in the documentation of `datetime.datetime.utcfromtimestamp` this problem is platform dependent. I briefly tried to find an "upper" date where this breaks but had no problems there.
talagluck commented 1 year ago

Hi @thoniTUB - thanks for raising! We will review and be in touch.

rreinoldsc commented 1 year ago

Hi @thoniTUB,

I've triaged, and found critical code and data that I believe is causing this exception. Could you run the following, and let me know if this triggers an OSError:

import datetime
# Unix time representation of data before 1970
data1 = [-31536000.5 -31536000.4 -31536000.3 -31536000.2 -31536000.1 -31536000., -31535999.9 -31535999.8 -31535999.7 -31535999.6 -31535999.5]
[datetime.datetime.utcfromtimestamp(value) for value in data1]

With that verification, I can continue to move towards a fix.

thoniTUB commented 1 year ago

Hi @rreinoldsc ,

Thank you for looking into the problem. Your code snippet triggerd the OSError:

$ python test.py
Traceback (most recent call last):
  File "test.py", line 4, in <module>
    [datetime.datetime.utcfromtimestamp(value) for value in data1]
  File "test.py", line 4, in <listcomp>
    [datetime.datetime.utcfromtimestamp(value) for value in data1]
OSError: [Errno 22] Invalid argument
rreinoldsc commented 1 year ago

Thank you @thoniTUB that confirmation is helpful.

I suspect this will fix the issue, but firstly, could you confirm this runs successfully:

docker run python:3.9 python -c 'import datetime; timestamp=-31536000;output = datetime.utcfromtimestamp(timestamp) if timestamp > 0 else datetime.datetime(1970, 1, 1) + datetime.timedelta(seconds=timestamp); print(output)'

Also, to better understand your environment:

Thanks, Rob

Shinnnyshinshin commented 1 year ago

Hi @thoniTUB

I had a look at this issue again, and it appears that we have addressed the issue in the process of refactoring our code into the "fluent" interface, which greatly simplifies the configuration required for connecting to data and creating assets to validate.

import great_expectations as gx
import os
# create context
context = gx.get_context()

datasource_name = "my_datasource"
asset_name = "date_data"
path_to_data = "data.txt".  # which contains the data you describe above. 

datasource = context.sources.add_or_update_pandas(datasource_name)

# here is where you define the `date` column to be parsed as a datetime object. 
asset = datasource.add_csv_asset(
        asset_name, filepath_or_buffer=path_to_data, parse_dates=["date"])

batch_request = asset.build_batch_request()

data_assistant_result = context.assistants.onboarding.run(
    batch_request=batch_request,
    #exclude_column_name_suffixes=["date"],
)

data_assistant_result.show_expectations_by_expectation_type()

In the output you will see the following configurations for expect_column_max_to_be_between, expect_column_min_to_be_between and expect_column_values_to_be_between, which happens if the date column was parsed correctly.

  { 'expect_column_max_to_be_between': { 'column': 'date',
                                         'domain': 'column',
                                         'max_value': '1970-01-01T00:00:00',
                                         'min_value': '1970-01-01T00:00:00',
                                         'strict_max': False,
                                         'strict_min': False}},
  { 'expect_column_min_to_be_between': { 'column': 'date',
                                         'domain': 'column',
                                         'max_value': '1969-01-01T00:00:00',
                                         'min_value': '1969-01-01T00:00:00',
                                         'strict_max': False,
                                         'strict_min': False}},

  { 'expect_column_values_to_be_between': { 'column': 'date',
                                            'domain': 'column',
                                            'max_value': '1970-01-01T00:00:00',
                                            'min_value': '1969-01-01T00:00:00',
                                            'mostly': 1.0,
                                            'strict_max': False,
                                            'strict_min': False}},

Environment:

Operating System: Mac OS Great Expectations Version: 0.17.15