Closed thoniTUB closed 1 year ago
Hi @thoniTUB - thanks for raising! We will review and be in touch.
Hi @thoniTUB,
I've triaged, and found critical code and data that I believe is causing this exception. Could you run the following, and let me know if this triggers an OSError:
import datetime
# Unix time representation of data before 1970
data1 = [-31536000.5 -31536000.4 -31536000.3 -31536000.2 -31536000.1 -31536000., -31535999.9 -31535999.8 -31535999.7 -31535999.6 -31535999.5]
[datetime.datetime.utcfromtimestamp(value) for value in data1]
With that verification, I can continue to move towards a fix.
Hi @rreinoldsc ,
Thank you for looking into the problem. Your code snippet triggerd the OSError:
$ python test.py
Traceback (most recent call last):
File "test.py", line 4, in <module>
[datetime.datetime.utcfromtimestamp(value) for value in data1]
File "test.py", line 4, in <listcomp>
[datetime.datetime.utcfromtimestamp(value) for value in data1]
OSError: [Errno 22] Invalid argument
Thank you @thoniTUB that confirmation is helpful.
I suspect this will fix the issue, but firstly, could you confirm this runs successfully:
docker run python:3.9 python -c 'import datetime; timestamp=-31536000;output = datetime.utcfromtimestamp(timestamp) if timestamp > 0 else datetime.datetime(1970, 1, 1) + datetime.timedelta(seconds=timestamp); print(output)'
Also, to better understand your environment:
docker run python:3.9 lscpu
Thanks, Rob
Hi @thoniTUB
I had a look at this issue again, and it appears that we have addressed the issue in the process of refactoring our code into the "fluent" interface, which greatly simplifies the configuration required for connecting to data and creating assets to validate.
import great_expectations as gx
import os
# create context
context = gx.get_context()
datasource_name = "my_datasource"
asset_name = "date_data"
path_to_data = "data.txt". # which contains the data you describe above.
datasource = context.sources.add_or_update_pandas(datasource_name)
# here is where you define the `date` column to be parsed as a datetime object.
asset = datasource.add_csv_asset(
asset_name, filepath_or_buffer=path_to_data, parse_dates=["date"])
batch_request = asset.build_batch_request()
data_assistant_result = context.assistants.onboarding.run(
batch_request=batch_request,
#exclude_column_name_suffixes=["date"],
)
data_assistant_result.show_expectations_by_expectation_type()
In the output you will see the following configurations for expect_column_max_to_be_between
, expect_column_min_to_be_between
and expect_column_values_to_be_between
, which happens if the date column was parsed correctly.
{ 'expect_column_max_to_be_between': { 'column': 'date',
'domain': 'column',
'max_value': '1970-01-01T00:00:00',
'min_value': '1970-01-01T00:00:00',
'strict_max': False,
'strict_min': False}},
{ 'expect_column_min_to_be_between': { 'column': 'date',
'domain': 'column',
'max_value': '1969-01-01T00:00:00',
'min_value': '1969-01-01T00:00:00',
'strict_max': False,
'strict_min': False}},
{ 'expect_column_values_to_be_between': { 'column': 'date',
'domain': 'column',
'max_value': '1970-01-01T00:00:00',
'min_value': '1969-01-01T00:00:00',
'mostly': 1.0,
'strict_max': False,
'strict_min': False}},
Environment:
Operating System: Mac OS Great Expectations Version: 0.17.15
Describe the bug GX's Data Assistant fails when a date outside the range described by https://docs.python.org/3/library/datetime.html#datetime.datetime.utcfromtimestamp
is encountered.
To Reproduce Steps to reproduce the behavior:
test.csv
with this content:date
-column as a date:date
from the column exclusion listFile:2, in run(batch_request, estimation, include_column_names, exclude_column_names, include_column_name_suffixes, exclude_column_name_suffixes, semantic_type_filter_module_name, semantic_type_filter_class_name, max_unexpected_values, max_unexpected_ratio, min_max_unexpected_values_proportion, allowed_semantic_types_passthrough, cardinality_limit_mode, max_unique_values, max_proportion_unique, table_rule, column_value_uniqueness_rule, column_value_nullity_rule, column_value_nonnullity_rule, numeric_columns_rule, datetime_columns_rule, text_columns_rule, categorical_columns_rule)
File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\rule_based_profiler\data_assistant\data_assistant_runner.py:176, in DataAssistantRunner.run_impl..run(batch_request, estimation, kwargs)
166 variables_directives_list: List[
167 RuntimeEnvironmentVariablesDirectives
168 ] = build_variables_directives(
(...)
171 variables_directives_kwargs,
172 )
173 domain_type_directives_list: List[
174 RuntimeEnvironmentDomainTypeDirectives
175 ] = build_domain_type_directives(**domain_type_directives_kwargs)
--> 176 data_assistant_result: DataAssistantResult = data_assistant.run(
177 variables_directives_list=variables_directives_list,
178 domain_type_directives_list=domain_type_directives_list,
179 )
180 return data_assistant_result
File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\rule_based_profiler\data_assistant\data_assistant.py:571, in DataAssistant.run(self, variables, rules, variables_directives_list, domain_type_directives_list) 565 batches: Dict[str, Union[Batch, FluentBatch]] = self._batches or {} 567 data_assistant_result = DataAssistantResult( 568 _batch_id_to_batch_identifier_display_name_map=self._batch_id_to_batch_identifier_display_name_map(), 569 _usage_statistics_handler=usage_statistics_handler, 570 ) --> 571 run_profiler_on_data( 572 data_assistant=self, 573 data_assistant_result=data_assistant_result, 574 profiler=self._profiler, 575 variables=variables, 576 rules=rules, 577 batch_list=list(batches.values()), 578 batch_request=None, 579 variables_directives_list=variables_directives_list, 580 domain_type_directives_list=domain_type_directives_list, 581 ) 582 return self._build_data_assistant_result( 583 data_assistant_result=data_assistant_result 584 )
File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\util.py:224, in measure_execution_time..execution_time_decorator..compute_delta_t(*args, *kwargs)
222 time_begin: float = (getattr(time, method))()
223 try:
--> 224 return func(args, **kwargs)
225 finally:
226 time_end: float = (getattr(time, method))()
File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\rule_based_profiler\data_assistant\data_assistant.py:756, in run_profiler_on_data(data_assistant, data_assistant_result, profiler, variables, rules, batch_list, batch_request, variables_directives_list, domain_type_directives_list) 739 """ 740 This method executes "run()" of effective "RuleBasedProfiler" and fills "DataAssistantResult" object with outputs. 741 (...) 751 domain_type_directives_list: additional/override runtime domain directives (modify "BaseRuleBasedProfiler") 752 """ 753 comment: str = f"""Created by effective Rule-Based Profiler of {data_assistant.class.name} with the \ 754 configuration included. 755 """ --> 756 rule_based_profiler_result: RuleBasedProfilerResult = profiler.run( 757 variables=variables, 758 rules=rules, 759 batch_list=batch_list, 760 batch_request=batch_request, 761 runtime_configuration=None, 762 reconciliation_directives=DEFAULT_RECONCILATION_DIRECTIVES, 763 variables_directives_list=variables_directives_list, 764 domain_type_directives_list=domain_type_directives_list, 765 comment=comment, 766 ) 767 result: DataAssistantResult = data_assistant_result 768 result.profiler_config = profiler.config
File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\core\usage_statistics\usage_statistics.py:318, in usage_statistics_enabled_method..usage_statistics_wrapped_method(*args, kwargs)
315 args_payload = args_payload_fn(*args, *kwargs) or {}
316 nested_update(event_payload, args_payload)
--> 318 result = func(args, kwargs)
319 message["success"] = True
320 except Exception:
File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\rule_based_profiler\rule_based_profiler.py:327, in BaseRuleBasedProfiler.run(self, variables, rules, batch_list, batch_request, runtime_configuration, reconciliation_directives, variables_directives_list, domain_type_directives_list, comment) 318 rule: Rule 319 for rule in pbar_method( 320 effective_rules, 321 desc="Generating Expectations:", (...) 325 bar_format="{desc:25}{percentage:3.0f}%|{bar}{r_bar}", 326 ): --> 327 rule_state = rule.run( 328 variables=effective_variables, 329 batch_list=batch_list, 330 batch_request=batch_request, 331 runtime_configuration=runtime_configuration, 332 reconciliation_directives=reconciliation_directives, 333 rule_state=RuleState(), 334 ) 335 self.rule_states.append(rule_state) 337 return RuleBasedProfilerResult( 338 fully_qualified_parameter_names_by_domain=self.get_fully_qualified_parameter_names_by_domain(), 339 parameter_values_for_fully_qualified_parameter_names_by_domain=self.get_parameter_values_for_fully_qualified_parameter_names_by_domain(), (...) 365 _usage_statistics_handler=self._usage_statistics_handler, 366 )
File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\util.py:224, in measure_execution_time..execution_time_decorator..compute_delta_t(*args, *kwargs)
222 time_begin: float = (getattr(time, method))()
223 try:
--> 224 return func(args, **kwargs)
225 finally:
226 time_end: float = (getattr(time, method))()
File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\rule_based_profiler\rule\rule.py:175, in Rule.run(self, variables, batch_list, batch_request, runtime_configuration, reconciliation_directives, rule_state) 172 expectation_configuration_builder: ExpectationConfigurationBuilder 174 for expectation_configuration_builder in expectation_configuration_builders: --> 175 expectation_configuration_builder.resolve_validation_dependencies( 176 domain=domain, 177 variables=variables, 178 parameters=rule_state.parameters, 179 batch_list=batch_list, 180 batch_request=batch_request, 181 runtime_configuration=runtime_configuration, 182 ) 184 return rule_state
File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\rule_based_profiler\expectation_configuration_builder\expectation_configuration_builder.py:129, in ExpectationConfigurationBuilder.resolve_validation_dependencies(self, domain, variables, parameters, batch_list, batch_request, runtime_configuration) 127 validation_parameter_builder: ParameterBuilder 128 for validation_parameter_builder in validation_parameter_builders: --> 129 validation_parameter_builder.build_parameters( 130 domain=domain, 131 variables=variables, 132 parameters=parameters, 133 parameter_computation_impl=None, 134 batch_list=batch_list, 135 batch_request=batch_request, 136 runtime_configuration=runtime_configuration, 137 )
File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\rule_based_profiler\parameter_builder\parameter_builder.py:191, in ParameterBuilder.build_parameters(self, domain, variables, parameters, parameter_computation_impl, batch_list, batch_request, runtime_configuration) 188 if parameter_computation_impl is None: 189 parameter_computation_impl = self._build_parameters --> 191 parameter_computation_result: Attributes = parameter_computation_impl( 192 domain=domain, 193 variables=variables, 194 parameters=parameters, 195 runtime_configuration=runtime_configuration, 196 ) 198 parameter_values: Dict[str, Any] = { 199 self.raw_fully_qualified_parameter_name: parameter_computation_result, 200 self.json_serialized_fully_qualified_parameter_name: convert_to_json_serializable( 201 data=parameter_computation_result 202 ), 203 } 205 build_parameter_container( 206 parameter_container=parameters[domain.id], 207 parameter_values=parameter_values, 208 )
File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\rule_based_profiler\parameter_builder\numeric_metric_range_multi_batch_parameter_builder.py:402, in NumericMetricRangeMultiBatchParameterBuilder._build_parameters(self, domain, variables, parameters, runtime_configuration) 386 round_decimals = self._get_round_decimals_using_heuristics( 387 metric_values=metric_values, 388 domain=domain, 389 variables=variables, 390 parameters=parameters, 391 ) 393 numeric_range_estimator: NumericRangeEstimator = ( 394 self._build_numeric_range_estimator( 395 round_decimals=round_decimals, (...) 399 ) 400 ) 401 numeric_range_estimation_result: NumericRangeEstimationResult = ( --> 402 self._estimate_metric_value_range( 403 metric_values=metric_values, 404 numeric_range_estimator=numeric_range_estimator, 405 round_decimals=round_decimals, 406 domain=domain, 407 variables=variables, 408 parameters=parameters, 409 ) 410 ) 412 value_range: np.ndarray = numeric_range_estimation_result.value_range 413 details: Dict[str, Any] = copy.deepcopy( 414 parameter_node[FULLY_QUALIFIED_PARAMETER_NAME_METADATA_KEY] 415 )
File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\rule_based_profiler\parameter_builder\numeric_metric_range_multi_batch_parameter_builder.py:634, in NumericMetricRangeMultiBatchParameterBuilder._estimate_metric_value_range(self, metric_values, numeric_range_estimator, round_decimals, domain, variables, parameters) 626 numeric_range_estimation_result = build_numeric_range_estimation_result( 627 metric_values=metric_value_vector, 628 min_value=metric_value_vector[0], 629 max_value=metric_value_vector[0], 630 ) 631 else: 632 # Compute low and high estimates for vector of samples for given element of multi-dimensional metric. 633 numeric_range_estimation_result = ( --> 634 numeric_range_estimator.get_numeric_range_estimate( 635 metric_values=metric_value_vector, 636 domain=domain, 637 variables=variables, 638 parameters=parameters, 639 ) 640 ) 642 min_value = numeric_range_estimation_result.value_range[0] 643 if lower_bound is not None:
File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\rule_based_profiler\estimators\numeric_range_estimator.py:73, in NumericRangeEstimator.get_numeric_range_estimate(self, metric_values, domain, variables, parameters) 55 def get_numeric_range_estimate( 56 self, 57 metric_values: np.ndarray, (...) 60 parameters: Optional[Dict[str, ParameterContainer]] = None, 61 ) -> NumericRangeEstimationResult: 62 """ 63 Method that invokes implementation of the estimation algorithm that is the subject of the inherited class. 64 Args: (...) 71 "NumericRangeEstimationResult" object, containing computed "value_range" and "estimation_histogram" details. 72 """ ---> 73 return self._get_numeric_range_estimate( 74 metric_values=metric_values, 75 domain=domain, 76 variables=variables, 77 parameters=parameters, 78 )
File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\rule_based_profiler\estimators\exact_numeric_range_estimator.py:67, in ExactNumericRangeEstimator._get_numeric_range_estimate(self, metric_values, domain, variables, parameters) 65 min_value: Number = np.amin(a=metric_values_converted) 66 max_value: Number = np.amax(a=metric_values_converted) ---> 67 return build_numeric_range_estimation_result( 68 metric_values=metric_values_converted, 69 min_value=min_value, 70 max_value=max_value, 71 )
File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\rule_based_profiler\helpers\util.py:881, in build_numeric_range_estimation_result(metric_values, min_value, max_value) 879 histogram = np.histogram(a=metric_values_converted, bins=NUM_HISTOGRAM_BINS) 880 # Use "UTC" TimeZone normalization in "bin_edges" when "metric_values" consists of "datetime.datetime" objects. --> 881 bin_edges = convert_ndarray_float_to_datetime_dtype(data=histogram[1]) 882 else: 883 histogram = np.histogram(a=metric_values, bins=NUM_HISTOGRAM_BINS)
File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\util.py:1699, in convert_ndarray_float_to_datetime_dtype(data) 1693 """ 1694 Convert all elements of 1-D "np.ndarray" argument from "float" type to "datetime.datetime" type objects. 1695 1696 Note: Converts to "naive" "datetime.datetime" values (assumes "UTC" TimeZone based floating point timestamps). 1697 """ 1698 value: Any -> 1699 return np.asarray([datetime.datetime.utcfromtimestamp(value) for value in data])
File ~\AppData\Local\pypoetry\Cache\virtualenvs\gx-eva-8BZTx4pS-py3.8\lib\site-packages\great_expectations\util.py:1699, in(.0)
1693 """
1694 Convert all elements of 1-D "np.ndarray" argument from "float" type to "datetime.datetime" type objects.
1695
1696 Note: Converts to "naive" "datetime.datetime" values (assumes "UTC" TimeZone based floating point timestamps).
1697 """
1698 value: Any
-> 1699 return np.asarray([datetime.datetime.utcfromtimestamp(value) for value in data])
OSError: [Errno 22] Invalid argument