kubeflow / metadata

Repository for assets related to Metadata.
Apache License 2.0
121 stars 69 forks source link

mysql_query failed: errno: 2006, error: MySQL server has gone away #198

Closed andrewm4894 closed 4 years ago

andrewm4894 commented 4 years ago

/kind bug

What steps did you take and what happened: I created a kubeflow via https://deploy.kubeflow.cloud/. I was trying to run through this demo but ran into what looks like a mysql error when i tried to create a workspace (cell 4 of the demo notebook).

---------------------------------------------------------------------------
_Rendezvous                               Traceback (most recent call last)
<ipython-input-6-4f220df654e0> in <module>
      4     name="workspace_1",
      5     description="a workspace for testing",
----> 6     labels={"n1": "v1"})
~/.local/lib/python3.6/site-packages/kubeflow/metadata/metadata.py in __init__(self, store, name, description, labels, reuse_workspace_if_exists, backend_url_prefix)
    130     self.description = description
    131     self.labels = labels
--> 132     self.context_id = self._get_context_id(reuse_workspace_if_exists)
    133 
    134   def list(self, artifact_type_name: str = None) -> List[Artifact]:
~/.local/lib/python3.6/site-packages/kubeflow/metadata/metadata.py in _get_context_id(self, reuse_workspace_if_exists)
    187 
    188   def _get_context_id(self, reuse_workspace_if_exists):
--> 189     ctx = self._get_existing_context()
    190     if ctx is not None:
    191       if reuse_workspace_if_exists:
~/.local/lib/python3.6/site-packages/kubeflow/metadata/metadata.py in _get_existing_context(self)
    219   def _get_existing_context(self):
    220     contexts = _retry(
--> 221         lambda: self.store.get_contexts_by_type(self.CONTEXT_TYPE_NAME))
    222     for ctx in contexts:
    223       if ctx.name == self.name:
~/.local/lib/python3.6/site-packages/retrying.py in wrapped_f(*args, **kw)
     47             @six.wraps(f)
     48             def wrapped_f(*args, **kw):
---> 49                 return Retrying(*dargs, **dkw).call(f, *args, **kw)
     50 
     51             return wrapped_f
~/.local/lib/python3.6/site-packages/retrying.py in call(self, fn, *args, **kwargs)
    210                 if not self._wrap_exception and attempt.has_exception:
    211                     # get() on an attempt with an exception should cause it to be raised, but raise just in case
--> 212                     raise attempt.get()
    213                 else:
    214                     raise RetryError(attempt)
~/.local/lib/python3.6/site-packages/retrying.py in get(self, wrap_exception)
    245                 raise RetryError(self)
    246             else:
--> 247                 six.reraise(self.value[0], self.value[1], self.value[2])
    248         else:
    249             return self.value
/usr/lib/python3/dist-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None
~/.local/lib/python3.6/site-packages/retrying.py in call(self, fn, *args, **kwargs)
    198         while True:
    199             try:
--> 200                 attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
    201             except:
    202                 tb = sys.exc_info()
~/.local/lib/python3.6/site-packages/kubeflow/metadata/metadata.py in _retry(f)
    755 def _retry(f):
    756   '''retry function f with exponential backoff'''
--> 757   return f()
    758 
    759 
~/.local/lib/python3.6/site-packages/kubeflow/metadata/metadata.py in <lambda>()
    219   def _get_existing_context(self):
    220     contexts = _retry(
--> 221         lambda: self.store.get_contexts_by_type(self.CONTEXT_TYPE_NAME))
    222     for ctx in contexts:
    223       if ctx.name == self.name:
~/.local/lib/python3.6/site-packages/ml_metadata/metadata_store/metadata_store.py in get_contexts_by_type(self, type_name)
    762     response = metadata_store_service_pb2.GetContextsByTypeResponse()
    763 
--> 764     self._call('GetContextsByType', request, response)
    765     result = []
    766     for x in response.contexts:
~/.local/lib/python3.6/site-packages/ml_metadata/metadata_store/metadata_store.py in _call(self, method_name, request, response)
    121     else:
    122       grpc_method = getattr(self._metadata_store_stub, method_name)
--> 123       response.CopyFrom(grpc_method(request))
    124 
    125   def _swig_call(self, method, request, response) -> None:
/usr/local/lib/python3.6/dist-packages/grpc/_channel.py in __call__(self, request, timeout, metadata, credentials, wait_for_ready, compression)
    563         state, call, = self._blocking(request, timeout, metadata, credentials,
    564                                       wait_for_ready, compression)
--> 565         return _end_unary_response_blocking(state, call, False, None)
    566 
    567     def with_call(self,
/usr/local/lib/python3.6/dist-packages/grpc/_channel.py in _end_unary_response_blocking(state, call, with_call, deadline)
    465             return state.response
    466     else:
--> 467         raise _Rendezvous(state, None, None, deadline)
    468 
    469 
_Rendezvous: <_Rendezvous of RPC that terminated with:
    status = StatusCode.INTERNAL
    details = "mysql_query failed: errno: 2006, error: MySQL server has gone away"
    debug_error_string = "{"created":"@1576688509.129253021","description":"Error received from peer ipv4:10.23.253.55:8080","file":"src/core/lib/surface/call.cc","file_line":1052,"grpc_message":"mysql_query failed: errno: 2006, error: MySQL server has gone away","grpc_status":13}"

What did you expect to happen:

A workspace would be created that i could store artifacts into.

Anything else you would like to add:

I also see what looks like the same error on the UI:

image

Environment:

rafbarr commented 4 years ago

I'm having the exact same issue. Reproduced by executing the metadata demo. I was able to "fix" it by just restarting metadata-grpc-deployment. So, it seems there's no relation to the database itself. Looks like some session is getting stale.

andrewm4894 commented 4 years ago

@rafaelbarreto87 if you get a moment can you share the kubectl commands you used to do this? New to kubernetes so not exactly sure yet how to do stuff like this.

andrewm4894 commented 4 years ago

Ps here is a response I got on the kubeflow slack, seems like it's actually an expected error. Not exactly sure on the details yet, need to read up a bit on the links below.

https://github.com/kubeflow/pipelines/issues/2329#issuecomment-549590635 Support for metadata db wasn't included in Kubeflow 0.7, the errors you see is expected.

zhenghuiwang commented 4 years ago

The cause is that the connection betweenmetadata-grpc-deployment and underlying mysql is reset after certain amount of time.

The temporary fix is as @rafaelbarreto87 suggested to restart grpc-deployment to reconnect.

The long term fix is to add probe endpoint to test the liveness of grpc-deployment so that it is restarted automatically.

andrewm4894 commented 4 years ago

@zhenghuiwang would you be able to point me towards and instructions on how to restart the grpc-deployment?

Apologies as i am new to kubernetes.

haghabozorgi commented 4 years ago

@andrewm4894 if you are on kubectl 1.15+ i believe you can use restart command kubectl rollout restart deployment metadata-grpc-deployment -n kubeflow

we are on 1.14 and have just deleted the pod and the deploy recreates the pod which also seems to fix the issue kubectl delete po -l=component=grpc-server -n kubeflow

xigang commented 4 years ago

Why do I need to restart the metadata-grpc-deployment service to work? If there is no other way to set the liveness probe?

discordianfish commented 4 years ago

Ran into this as well. It looks like it successfully connected but the readiness probe fails. Since it's not the health check nothing is retrying this operation either. The easiest fix would be using the same readiness probe for the health check but I'd like to understand why it gets into this 'gone away' state in the first place..

zhenghuiwang commented 4 years ago

Thanks @discordianfish for adding liveness probe for HTTP server.

For gRPC deployment, the issue is fixed upstream in MLMD v0.21 (commit), which is included in the Kubeflow v1.0RCs.

So this is an issue for KF v0.7 but not for v1.0RCs

discordianfish commented 4 years ago

Ah great, thanks for explaining.

rafbarr commented 4 years ago

Just FYI, MLMD has been updated in master, not in v1.0-branch. So, if you're using kdefs from v1.0-branch, this is still a problem for now.

zhenghuiwang commented 4 years ago

@rafaelbarreto87 Good catch! The above cherry pick should fix it for KF v1.0

jlewi commented 4 years ago

Reopening because of kubeflow/kubeflow#4797