BlaziusMaximus commented 1 year ago

This RFC will be open for comment until Friday, August 11th, 2023. cc @k-w-w @petrychenko

Unbound SavedModel

Status	Implemented
RFC #	453
Author(s)	Kathy Wu (kathywu@google.com), Adam Cogdell (adamcogdell@google.com)
Sponsor	Ivan Petrychenko (petrychenko@google.com)
Updated	2023-07-24

Objective

We are proposing a new format for SavedModel and a generic proto-splitting library that resolves the 2GB proto issue. The purpose of this RFC is to publicize the design of this new format, which has been implemented (see tensorflow/tools/proto_splitter), but is open to comments and changes from the open source community.

othakkar commented 1 year ago

@BlaziusMaximus Thank you for the proposal. Based on my understanding, the proposed implementation solves the issue of exporting a SavedModel that currently fails due to exceeding the 2GB protobuf serialization limit.

Does the proposed solution also address the protobuf's limitation that is encountered while importing a GraphDef or while freezing (converting variables to constants that are of size > 2GB) a SavedModel? (e.g. https://github.com/tensorflow/tensorflow/issues/60571#issuecomment-1546545861) If not, are there any plans for the same? Thanks!

k-w-w commented 1 year ago

@othakkar Thanks for your comment! The GraphDef freezing code is not planned to be updated as part of this project.

However, you may be able to use some of the added infrastructure to address this issue. If I remember the implementation of convert_to_constants_v2 correctly, it goes through Graph.as_graph_def at some point which fails if the graph is > 2GB, because it first calls into the C API's Graph to GraphDef function, then serializes it to string (then deserializes the string in Python). This string serialization is typically what hits the 2GB limit.

@BlaziusMaximus added a new argument to Graph._as_graph_def which directly uses pybind11 proto and skips the string serialization. The default is currently set to False to be safe (performance-wise and behavior-wise), but you could try setting the default to True to always use pybind11.

  def _as_graph_def(
      self, from_version=None, add_shapes=False, use_pybind11_proto=False):

https://github.com/tensorflow/tensorflow/blob/863451d7d0875aac843e65f485cdf7dcbbd72c0a/tensorflow/python/framework/ops.py#L2300C7-L2301

edit: Actually, convert_to_constants_v2 uses import_graph_def [1], although it also does the string serialization. This code path has not been updated to use Pybind11. A fix similar to the change in Graph._as_graph_def can be applied here.

[1] https://github.com/tensorflow/tensorflow/blob/v2.13.0/tensorflow/python/framework/importer.py#L507-L508

othakkar commented 1 year ago

@k-w-w thank you for your response! The reference implementation (in as_graph_def(...)) that uses pybind11 is used to convert a Graph into a GraphDef, whereas the input to the import_graph_def(...) function itself is a GraphDef proto that is further serialized and is used to be imported into the default graph.

Would you or @BlaziusMaximus be able to elaborate on how to go about updating this code-path to use pybind11 proto api? My goal is to freeze a GraphDef by avoiding the 2GB protobuf limit issue. Please excuse my lack of understanding of the pybind11 proto api :)

BlaziusMaximus commented 1 year ago

@othakkar Absolutely! I'll use _as_graph_def as an example.

if use_pybind11_proto:
  with self._c_graph.get() as c_graph:
    graph = graph_pb2.GraphDef()
    graph.CopyFrom(pywrap_tf_session.TF_GraphToGraphDefPybind(c_graph))
else:
  with c_api_util.tf_buffer() as buf:
    with self._c_graph.get() as c_graph:
      pywrap_tf_session.TF_GraphToGraphDef(c_graph, buf)
      data = pywrap_tf_session.TF_GetBuffer(buf)
    graph = graph_pb2.GraphDef()
    graph.ParseFromString(compat.as_bytes(data))

Before the pybind11 addition, _as_graph_def called pywrap_tf_session.TF_GraphToGraphDef, which simply passed the serialized proto via a pointer from the pywrap session in C++ to python. This was fine when we could rely on GraphDef serialization, but we no longer can when it's >2GB. Pybind11 (with pybind11_protobuf) allows us to pass protos directly to python by simply returning the heap-allocated proto, like we do here in TF_GraphToGraphDefPybind:

m.def("TF_GraphToGraphDefPybind", [](PyGraph* graph) {
  tensorflow::Safe_TF_StatusPtr status =
      tensorflow::make_safe(TF_NewStatus());
  // Release GIL.
  py::gil_scoped_release release;
  TF_Graph* tf_graph = graph->tf_graph();
  auto def = new tensorflow::GraphDef();
  {
    tensorflow::mutex_lock l(tf_graph->mu);
    tf_graph->graph.ToGraphDef(def);
  }
  tensorflow::MaybeRaiseRegisteredFromTFStatusWithGIL(status.get());
  return def;
});

It's important that the module we're defining this function in is declared as a PYBIND11_MODULE with NativeProtoCasters imported:

PYBIND11_MODULE(_pywrap_tf_session, m) {
  pybind11_protobuf::ImportNativeProtoCasters();
  ...
}

othakkar commented 1 year ago

@BlaziusMaximus thanks for the explanation. I've been exploring how to update the import_graph_def() code-path to use pybind11_protobuf and I could use your help with the following: Similar to how pybind11_protobuf allows us to pass protos directly from C++ to Python, is there a way to pass a GraphDef proto from python to C++ without performing serialization? This would be needed to invoke the TF_GraphImportGraphDefWithResults from pywrap session in C++.

On another note, I'm wondering how do I use the chunks and apply some sort of transformation on it? Can you give an end-to-end example where one splits a very large (>2GB) GraphDef (in-memory), applies some sort of transformation to the chunks and merges the modified chunks into a new GraphDef proto? I saw this example provided by you, but was confused about how to apply any transformation on the chunks.

BlaziusMaximus commented 1 year ago

@othakkar I'm currently working on an answer to your first question, and have created a separate issue for it (https://github.com/tensorflow/tensorflow/issues/61587).

Unfortunately, I don't have a coded-up example to give for your second question. However, one example of how to use in-memory chunks is if two processes need to send chunked protos over a network, which would require the sender to Split() and the receiver to Merge().

But if you do need to apply a custom transformation, that would require a lot of parsing logic similar to what's done in the Merger, where you'd use a riegeli::Reader to parse through the chunks, using the ChunkMetadata as a guide and making changes along the way. See the in-depth guide for more info, or let me know if you have any specific questions. Glad to see that you're interested in the api!

tensorflow / community

RFC: Unbound SavedModel #453

Unbound SavedModel

Objective