jlewi / flaap

Federated Learning and Analytics Protocols
Apache License 2.0
0 stars 0 forks source link

coordinator RPC is failing to taskstore #16

Closed jlewi closed 2 years ago

jlewi commented 2 years ago

Here's the error

INFO:absl:Received retryable gRPC error: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Failed to marshal data to serialize: proto: google.protobuf.Any: unable to resolve "type.googleapis.com/tensorflow.GraphDef": not
 found"
        debug_error_string = "{"created":"@1664235008.153400140","description":"Error received from peer ipv4:127.0.0.1:8081","file":"src/core/lib/
surface/call.cc","file_line":952,"grpc_message":"Failed to marshal data to serialize: proto:\u00c2\u00a0google.protobuf.Any: unable to resolve "typ
e.googleapis.com/tensorflow.GraphDef": not found","grpc_status":2}"
>

This serialization error is happening inside the taskstore server. It looks like the problem is we don't have GoLang generated files for the TensorFlow protocol buffers.

jlewi commented 2 years ago

I think there are two possible solutions

  1. Generate GoLang libraries for the TensorFlow protocol buffers
  2. Change the Task.proto to treat the request/response payloads as an opaque set of bytes

Unfortunately it doesn't look like there are any precompiled Go versions of the protocol buffers https://github.com/tensorflow/tensorflow/tree/master/tensorflow/go

Given the TF protos are multiple files I don't think we want to try building them without bazel.

Treating it as opaque bytes might be easier as it avoids the complexity of trying to check in GoLang proto libraries and keep them in sync. Having to blaze build the protos and then copy the generated golang files could be cumbersome to maintain.

jlewi commented 2 years ago

Change is on the branch jlewi/list but not merged to master yet.