terraform-google-modules / terraform-google-slo

Creates SLOs on Google Cloud from custom Stackdriver metrics capability to export SLOs to Google Cloud services and other systems
https://registry.terraform.io/modules/terraform-google-modules/slo/google
Apache License 2.0
63 stars 29 forks source link

A python error is occurs on the slo-pipeline cloud function #14

Closed bkamin29 closed 4 years ago

bkamin29 commented 4 years ago

Hello,

We obtain regularly errors on the slo-pipeline cloud function.

For example, on February 25, 2020 between 7:45 p.m. and 8 p.m., there were no successful executions for the cloud functions slo-pipeline :

image

When i check logs :

image

In the logs i regularly see this type of error :

Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 383, in run_background_function _function_handler.invoke_user_function(event_object) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 217, in invoke_user_function return call_user_function(request_or_event) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 214, in call_user_function event_context.Context(**request_or_event.context)) File "/user_code/main.py", line 25, in main compute.export(data, exporters) File "/env/local/lib/python3.7/site-packages/slo_generator/compute.py", line 94, in export ret = exporter.export(data, **config) File "/env/local/lib/python3.7/site-packages/slo_generator/exporters/stackdriver.py", line 47, in export self.create_timeseries(data, **config) File "/env/local/lib/python3.7/site-packages/slo_generator/exporters/stackdriver.py", line 64, in create_timeseries data['error_budget_policy_step_name']) TypeError: 'NoneType' object is not subscriptable

or this other error :

Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/api_core/grpc_helpers.py", line 57, in error_remapped_callable return callable_(*args, **kwargs) File "/env/local/lib/python3.7/site-packages/grpc/_channel.py", line 824, in __call__ return _end_unary_response_blocking(state, call, False, None) File "/env/local/lib/python3.7/site-packages/grpc/_channel.py", line 726, in _end_unary_response_blocking raise _InactiveRpcError(state) grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.INVALID_ARGUMENT details = "One or more TimeSeries could not be written: Points must be written in order. One or more of the points specified had an older end time than the most recent point.: timeSeries[0]" debug_error_string = "{"created":"@1582657332.973434028","description":"Error received from peer ipv4:173.194.76.95:443","file":"src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"One or more TimeSeries could not be written: Points must be written in order. One or more of the points specified had an older end time than the most recent point.: timeSeries[0]","grpc_status":3}" > The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 383, in run_background_function _function_handler.invoke_user_function(event_object) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 217, in invoke_user_function return call_user_function(request_or_event) File "/env/local/lib/python3.7/site-packages/google/cloud/functions/worker.py", line 214, in call_user_function event_context.Context(**request_or_event.context)) File "/user_code/main.py", line 25, in main compute.export(data, exporters) File "/env/local/lib/python3.7/site-packages/slo_generator/compute.py", line 94, in export ret = exporter.export(data, **config) File "/env/local/lib/python3.7/site-packages/slo_generator/exporters/stackdriver.py", line 47, in export self.create_timeseries(data, **config) File "/env/local/lib/python3.7/site-packages/slo_generator/exporters/stackdriver.py", line 90, in create_timeseries result = self.client.create_time_series(project, [series]) File "/env/local/lib/python3.7/site-packages/google/cloud/monitoring_v3/gapic/metric_service_client.py", line 1039, in create_time_series request, retry=retry, timeout=timeout, metadata=metadata File "/env/local/lib/python3.7/site-packages/google/api_core/gapic_v1/method.py", line 143, in __call__ return wrapped_func(*args, **kwargs) File "/env/local/lib/python3.7/site-packages/google/api_core/retry.py", line 286, in retry_wrapped_func on_error=on_error, File "/env/local/lib/python3.7/site-packages/google/api_core/retry.py", line 184, in retry_target return target() File "/env/local/lib/python3.7/site-packages/google/api_core/timeout.py", line 214, in func_with_timeout return func(*args, **kwargs) File "/env/local/lib/python3.7/site-packages/google/api_core/grpc_helpers.py", line 59, in error_remapped_callable six.raise_from(exceptions.from_grpc_error(exc), exc) File "<string>", line 3, in raise_from google.api_core.exceptions.InvalidArgument: 400 One or more TimeSeries could not be written: Points must be written in order. One or more of the points specified had an older end time than the most recent point.: timeSeries[0]

Top errors : image

When regarding the executions list, we can see many bad executions :

image

I'm not sure that the no execution on February 25, 2020 is linked to the regular bad executions. Maybe a service outage of Cloud Function service ?

image

image

ocervell commented 4 years ago

Hello @benesis002, can you try deploying on the latest master branch and specify slo_generator_version = "1.0.0" to your modules ?

The first error happened when an empty report gets send to Pub/Sub: the slo_pipeline Cloud Function cannot export results that are None: this should be fixed in the current slo-generator version.

The second error seems transient, there should be an automatic retry embeeded in the client, so it would be good to see if the error_budget_burn_rate timeseries has 'holes' in it at the timestamp of the error. I think it's due to writing the timeseries simultaneously for all the steps in the error budget.

ocervell commented 4 years ago

@benesis002 did you have time to test this ? I have re-deployed the slo-generator on the newest version and can't reproduce the 400 error. I do have some 500 that I have reported internally since I have no debug logs there.

ocervell commented 4 years ago

Fixed 400 by upgrading to slo_generator_version=1.0.0