kubeflow / examples

A repository to host extended examples and tutorials
Apache License 2.0
1.39k stars 751 forks source link

KF v1.6.0-rc.1 - MNIST E2E on Kubeflow on Vanilla k8s - TypeError: write() argument must be str, not <class 'bytes'> #993

Open julioo opened 1 year ago

julioo commented 1 year ago

Hello,

Testing KF v1.6.0-rc.1 Mnist E2E on vanilla K8s, I get an error executing tfjob_client.wait_for_job.

from kubeflow.tfjob import TFJobClient tfjob_client = TFJobClient() tfjob_client.wait_for_job(train_name, namespace=namespace, watch=True)

Using KF v.1.5, it was successful but using v1.6.0-rc.1 an exception is raised.

TypeError: write() argument must be str, not <class 'bytes'>

Replacing previous iostream file version, the execution is successful. /opt/conda/lib/python3.8/site-packages/ipykernel/iostream.py

Comparing the previous version with the latest version

The following code seems to cause the issue.

    if not isinstance(string, str):
        raise TypeError(f"write() argument must be str, not {type(string)}")

I don't know how to fix the issue but my current workaround is to replace the iostream.py file with the previous version.

Thank you

jbottum commented 1 year ago

@kubeflow/wg-training-leads any input on this ?

johnugeorge commented 1 year ago

There are no changes in SDK with respect to dependencies. btw, ipykernel is not a dependency of training-sdk . https://pypi.org/project/kubeflow-training/

And this is tested in CI as well https://github.com/kubeflow/training-operator/blob/master/.github/workflows/integration-tests.yaml#L38

julioo commented 1 year ago

@johnugeorge

There are no changes in SDK with respect to dependencies. btw, ipykernel is not a dependency of training-sdk . https://pypi.org/project/kubeflow-training/

And this is tested in CI as well https://github.com/kubeflow/training-operator/blob/master/.github/workflows/integration-tests.yaml#L38

Reproduced the same situation JupytherLab Version 3.4.3 using jupyter-tensorflow-full:v1.6.0-rc.1

Executing

from kubeflow.tfjob import TFJobClient tfjob_client = TFJobClient() tfjob_client.wait_for_job(train_name, namespace=namespace, watch=True)

Get the error

TypeError                                 Traceback (most recent call last)
Input In [18], in <cell line: 3>()
      1 from kubeflow.tfjob import TFJobClient
      2 tfjob_client = TFJobClient()
----> 3 tfjob_client.wait_for_job(train_name, namespace=namespace, watch=True)

File ~/git_tf-operator/sdk/python/kubeflow/tfjob/api/tf_job_client.py:220, in TFJobClient.wait_for_job(self, name, namespace, timeout_seconds, polling_interval, watch, status_callback)
    217   namespace = utils.get_default_target_namespace()
    219 if watch:
--> 220   tfjob_watch(
    221     name=name,
    222     namespace=namespace,
    223     timeout_seconds=timeout_seconds)
    224 else:
    225   return self.wait_for_condition(
    226     name,
    227     ["Succeeded", "Failed"],
   (...)
    230     polling_interval=polling_interval,
    231     status_callback=status_callback)

File ~/.local/lib/python3.8/site-packages/retrying.py:49, in retry.<locals>.wrap.<locals>.wrapped_f(*args, **kw)
     47 @six.wraps(f)
     48 def wrapped_f(*args, **kw):
---> 49     return Retrying(*dargs, **dkw).call(f, *args, **kw)

File ~/.local/lib/python3.8/site-packages/retrying.py:212, in Retrying.call(self, fn, *args, **kwargs)
    209 if self.stop(attempt_number, delay_since_first_attempt_ms):
    210     if not self._wrap_exception and attempt.has_exception:
    211         # get() on an attempt with an exception should cause it to be raised, but raise just in case
--> 212         raise attempt.get()
    213     else:
    214         raise RetryError(attempt)

File ~/.local/lib/python3.8/site-packages/retrying.py:247, in Attempt.get(self, wrap_exception)
    245         raise RetryError(self)
    246     else:
--> 247         six.reraise(self.value[0], self.value[1], self.value[2])
    248 else:
    249     return self.value

File /opt/conda/lib/python3.8/site-packages/six.py:703, in reraise(tp, value, tb)
    701     if value.__traceback__ is not tb:
    702         raise value.with_traceback(tb)
--> 703     raise value
    704 finally:
    705     value = None

File ~/.local/lib/python3.8/site-packages/retrying.py:200, in Retrying.call(self, fn, *args, **kwargs)
    198 while True:
    199     try:
--> 200         attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
    201     except:
    202         tb = sys.exc_info()

File ~/git_tf-operator/sdk/python/kubeflow/tfjob/api/tf_job_watch.py:55, in watch(name, namespace, timeout_seconds)
     52 status = last_condition.get('type', '')
     53 update_time = last_condition.get('lastTransitionTime', '')
---> 55 tbl(tfjob_name, status, update_time)
     57 if name == tfjob_name:
     58   if status == 'Succeeded' or status == 'Failed':

File /opt/conda/lib/python3.8/site-packages/table_logger/table_logger.py:204, in TableLogger.__call__(self, *args)
    200     raise ValueError('Expected number of columns is {}. Got {}.'.format(
    201         len(self.formatters), len(row_cells)))
    203 line = self.format_row(*row_cells)
--> 204 self.print_line(line)

File /opt/conda/lib/python3.8/site-packages/table_logger/table_logger.py:308, in TableLogger.print_line(self, text)
    307 def print_line(self, text):
--> 308     self.file.write(text.encode(self.encoding))
    309     self.file.write(b'\n')
    310     self.file.flush()

File /opt/conda/lib/python3.8/site-packages/ipykernel/iostream.py:529, in OutStream.write(self, string)
    519 """Write to current stream after encoding if necessary
    520 
    521 Returns
   (...)
    525 
    526 """
    528 if not isinstance(string, str):
--> 529     raise TypeError(f"write() argument must be str, not {type(string)}")
    531 if self.echo is not None:
    532     try:

TypeError: write() argument must be str, not <class 'bytes'>

Replacing iostream.py file with the previous version, get proper result

from kubeflow.tfjob import TFJobClient
tfjob_client = TFJobClient()
tfjob_client.wait_for_job(train_name, namespace=namespace, watch=True)`
mnist-train-05e7               Created              2022-08-22T14:52:59Z          
mnist-train-05e7               Running              2022-08-22T14:53:08Z          
mnist-train-05e7               Running              2022-08-22T14:53:08Z          
mnist-train-05e7               Succeeded            2022-08-22T14:53:30Z  

Note this error isn't blocking, the example is served and deployed with success.

jacklu2016 commented 1 year ago

Hi Julioo, I am newbi to kubeflow. feel a little comfuse with this mnist E2E on kubeflow on Vanilla k8s example. Pls help First, Should we run jupyter-tensorflow-full:v1.6.0-rc.1 image on k8s which install kubeflow? or we can run jupyter-tensorflow-full:v1.6.0-rc.1 anywhere in docker runtime? Second, I found the notebook first import kubenete client, but I dont found anywhere pip install kubenete? and don't we need to obvious config how to connect to our k8s cluster?