jlewi / flaap

Federated Learning and Analytics Protocols
Apache License 2.0
0 stars 0 forks source link

worker task updates are failing because task isn't assigned to that worker #18

Open jlewi opened 2 years ago

jlewi commented 2 years ago

Here's the error in the worker logs.

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/storage/jupyter/git_flaap/py/flaap/tff/task_handler.py", line 173, in <module>
    fire.Fire(Runner)
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 466, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/storage/jupyter/git_flaap/py/flaap/tff/task_handler.py", line 163, in run
    asyncio.run(handler.run())
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
    return future.result()
  File "/storage/jupyter/git_flaap/py/flaap/tff/task_handler.py", line 115, in run
    await self._poll_and_handle_task()
  File "/storage/jupyter/git_flaap/py/flaap/tff/task_handler.py", line 108, in _poll_and_handle_task
    response = _run_rpc(self._tasks_stub.Update, update_request)
  File "/opt/conda/lib/python3.10/site-packages/tenacity/__init__.py", line 324, in wrapped_f
    return self(f, *args, **kw)
  File "/opt/conda/lib/python3.10/site-packages/tenacity/__init__.py", line 404, in __call__
    do = self.iter(retry_state=retry_state)
  File "/opt/conda/lib/python3.10/site-packages/tenacity/__init__.py", line 349, in iter
    return fut.result()
  File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/opt/conda/lib/python3.10/site-packages/tenacity/__init__.py", line 407, in __call__
    result = fn(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/tensorflow_federated/python/common_libs/tracing.py", line 228, in sync_trace
    result = fn(*fn_args, **fn_kwargs)
  File "/storage/jupyter/git_flaap/py/flaap/tff/task_handler.py", line 147, in _run_rpc
    return rpc_func(request)
  File "/opt/conda/lib/python3.10/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/opt/conda/lib/python3.10/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.FAILED_PRECONDITION
        details = "Worker  can't update task 1937980ad16b4070bb7e3ab8e26ca43c; this task has not been assigned to that worker"
        debug_error_string = "{"created":"@1664334549.643516205","description":"Error received from peer ipv4:127.0.0.1:8081","file":"src/core/lib/
surface/call.cc","file_line":952,"grpc_message":"Worker  can't update task 1937980ad16b4070bb7e3ab8e26ca43c; this task has not been assigned to tha
t worker","grpc_status":9}"
>

There's no error in the taskstore logs.

Taskstore logs show

{"level":"info","ts":1664334549.5726326,"caller":"tasks/file.go:236","msg":"Assigning worker to group","workerId":"d09846598ce24e64915848c0c09e7a69
","group":"e73c621fadfb42c2ae21c39cbee0fdbd"}

We should probably update the server to print out the error.

Worker logs don't print out its id.

jlewi commented 2 years ago

The error message:

"{"created":"@1664334549.643516205","description":"Error received from peer ipv4:127.0.0.1:8081","file":"src/core/lib/
surface/call.cc","file_line":952,"grpc_message":"Worker  can't update task 1937980ad16b4070bb7e3ab8e26ca43c; this task has not been assigned to tha
t worker","grpc_status":9}"

Indicates the server thinks the worker_id is the empty string.

jlewi commented 2 years ago

Looks like a bug in task_handler code with not setting worker_id