tensorflow / rust

Rust language bindings for TensorFlow
Apache License 2.0
5.16k stars 422 forks source link

Contributing TFRecordWriter and TFExampleParser #162

Closed masonk closed 3 years ago

masonk commented 6 years ago

In my own project, I am using Rust to write TFRecords full of TFExamples to disk.

The code for doing this is something that might be useful to many projects that import the tensorflow crate, and I'd like to submit a PR for these utils.

However, it would be nice to have an agreement from package maintainers on the specifics of the design before I code up the CL. To give an example of outstanding questions:

  1. Do you even want RecordWriter and/or ExampleWriter in this crate?
  2. Namespace? By analogy to the Python API, what about tensorflow::io?
  3. File structure? src/io.rs?
adamcrume commented 6 years ago

I like the idea, and I appreciate asking for direction before submitting a pull request. I have a few questions; please don't be scared off. :)

In response to your questions:

  1. In a general sense, yes. I think if something shows up in the standard TensorFlow API, it probably makes sense to have it in this crate. We use the C API as much as possible, though, and this isn't currently exposed through that API, so I assume this would be pure Rust? If so, it would have to come with the caveat that if this functionality is exposed through the C API later, we may switch to use that in order to ensure consistent behavior with other language bindings.
  2. That sounds fine. We do need to start organizing our namespace, which is a mess at the moment.
  3. src/io.rs sounds good to me.

Would this introduce additional dependencies? If this requires a protobuf implementation, we would want to be careful to choose the right crate. Likewise if there are any file format dependencies.

Does this include something like an Example struct? What would that look like? We currently can't express something like Map<String, Tensor> because Tensor is generic in the element type, so you couldn't have a map containing both int and float tensors, for example. We could expose something similar to the internal AnyTensor trait, and then store a Map<String, Box<AnyTensor>>.

@jhseu Do you know if record/example IO will ever be provided via the C API?

masonk commented 6 years ago

Yes, RecordWriter is pure Rust. Its deps are crc32c and byteorder. I have no qualms with replacing the code with externs to C, of course. I'm pretty sure the binary format of a TFRecord is stable at this point, so I think the best reason to do it would be perf.

As for creating Examples, mine are serialized using the protobuf crate. I haven't spent any time wrapping the raw protobuf gencode API . I will put some put thought into an Example API, as there are a few layers of boilerplate which kind of clutter things. And, yes, this way would entail a protobuf dep. I think it would make sense to make this an optional dep.

Regardless of whether we wrap the gencode API, I think that example.proto and feature.proto are so stable that our best option is to check in the gencode to avoid requiring a build step (and, by the way, a dependency on protoc). We'd only have to update the repo with new gencode whenever the main repository updates those protobufs which, I wager, will happen rarely. There's a lot of data sitting on disks that is in this format, so I wouldn't be surprised if it never changes.

jhseu commented 6 years ago

We won't have a C API for RecordWriter. The (uncompressed) implementation is simple enough that there isn't really a need. We already have a separate Java implementation in the ecosystem repository.

I would suggest that you design it so that it's orthogonal to tf.Examples. From a TensorFlow perspective, we support reading any arbitrary strings out of RecordWriter, so the interface should just be about reading and writing bytes.

adamcrume commented 6 years ago

That sounds good.

@masonk Would you might sending an initial pull request just for the RecordWriter with support for writing strings, leaving out Example support? Once that's sorted we can hash out the details for Examples.

masonk commented 6 years ago

Sounds good. I think I'll get to it this weekend.

masonk commented 6 years ago

Would you like ExampleWriter, too?

adamcrume commented 6 years ago

Yes, but I'd like a little more detail first so we can make sure we're on the same page. Which crate(s) would you add? What exactly would the public API be?

masonk commented 6 years ago

I haven't thought it through yet, but before I do any coding, I'll write up a small design doc and post it --- here? Or another issue?

On Sun, May 20, 2018 at 7:54 PM, Adam Crume notifications@github.com wrote:

Yes, but I'd like a little more detail first so we can make sure we're on the same page. Which crate(s) would you add? What exactly would the public API be?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/rust/issues/162#issuecomment-390539923, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbdznHhQ0APcyCLuYFtT1QpgeXmyW9-ks5t0iyCgaJpZM4T8pd4 .

adamcrume commented 6 years ago

Sounds good. Here's fine.

liufuyang commented 4 years ago

One question, do we have APIs in Rust to read TFRecords? 🤔

masonk commented 4 years ago

I never submitted a record reader, and I don't see one in the repo right now.

It would be good to have one, though.

masonk commented 4 years ago

I revisited this today to remember why I never submitted a reader. Here's what I remember:

masonk commented 4 years ago

Pull request for a basic reader

https://github.com/tensorflow/rust/pull/240

liufuyang commented 4 years ago

Interesting, thanks for the fast response, that is a lot of info to take in.

Perhaps just another question so far. The first case I am using Rust to read TFRecord is probably on the serving side. As we will have some features ready in places like BigTable and when providing a user id on the backend we will get a single feature row for that user, in TRRecord formmat.

I am thinking about using Rust to build a backend to host the model and take in the TFRecord (one row) and return a model prediction to the caller.

I think with your PR #240 this can also be achieved? :)

masonk commented 4 years ago

Yes, it should be plenty for that.