tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.31k stars 1.54k forks source link

Would the team accept a JS version, published in NPM (instead of PyPi)? #60

Open bileschi opened 5 years ago

bileschi commented 5 years ago

Is your feature request related to a problem? Please describe. Ideally, the datasets API would be available cross language, like Keras or TensorFlow. Many TF learners are coming to TensorFlow from JavaScript, and would benefit from the access to known datasets.

Describe the solution you'd like

in package.json

npm add @tensorflow/tensorflow-datasets 

in index.js

import * as tfds from '@tensorflow/tensorflow-datasets' 
const ds = tfds.load(name='mnist');

Additional context js.tensorflow.org

rsepassi commented 5 years ago

Absolutely, this would be fantastic! We’d love for TFDS to be cross language.

Looks like TFJS has a tf.data.Dataset API.

I think we’d keep all the data generation in Python, but the input pipeline could be portable. The input pipeline is a TF Graph, so one way of adding cross-language support is to export the TF graph for each dataset’s input pipeline and have various languages implement the tf.data API in their TF front-end.

Alternatively we could re-implement the elements of TFDS that are responsible for the input pipeline: reading elements from disk, the various FeatureConnector.decode_example methods, etc. This seems like it would be a significant maintenance burden, so we probably won’t have any other supported front-end of this kind anytime soon, but we’d be happy to point to community contributions.

I’ll reach out to others on TF about how well-supported tf.data is in other language front-ends.

What are other ways of approaching this problem?

Thanks for the suggestion!

bileschi commented 5 years ago

Thanks @rsepassi

What sort of pipelining is done, before the data is yielded? Is it possible to store the result of the decode_example methods in some language-agnostic format (like apache arrow), and then yield those over the wire?

rsepassi commented 5 years ago

It's a simple tf.data pipeline:

  1. Read records from TFRecord files
  2. Parse Example protos
  3. Decode each feature, if there's special decoding (e.g. decode jpg images to [h, w, c])

Arrow is a great idea!

Here's what I think would be necessary to enable nice cross-language support:

  1. Add a TF op to convert a Tensor or dict of Tensor to an Arrow RecordBatch or DictionaryBatch.
  2. Add a TF op to write to an Arrow Stream.

This would enable:

dataset = tfds.load("mnist", split="train")
dataset = dataset.apply(tf.io.arrow.to_dictionary_batch(...options...))
stream = tf.io.arrow.Stream("host:port")
for example in dataset:
  stream.write(example)

Then we can work on performance optimizations for copies/serialization. Arrow itself should be 0-copy which is nice, and hopefully we can figure out a way to get the TF to Arrow op to be 0-copy too.

These ops can live in tensorflow/io's arrow directory. @yongtang and @BryanCutler added Arrow reading support to tensorflow/io in tensorflow/io#36. What do you 2 think about the above?

@mrry @jsimsa from the tf.data side. What do you 2 think about the above?

yongtang commented 5 years ago

@bileschi @rsepassi That would be great! We are working on adding more write ops support in tensorflow-io for many formats and Arrow will be an idea case.

/cc @BryanCutler as he may have more knowledge with Arrow.

BryanCutler commented 5 years ago

@rsepassi I think that sounds like a good idea! You will want to follow the Arrow stream protocol for send over the wire see here. This is basically a schema to describe the data, followed by a sequence of record batches.

The Arrow DictionaryBatch is meant to encode data in a regular RecordBatch for efficiency, e.g. encoding categorical strings as integers. Just sticking with regular RecordBatchs is probably best to start with.

rsepassi commented 5 years ago

Just to be clear this solution would not work in a browser. This is in the server environment, since you'd still have to run the Python code. @bileschi is that still useful?

bileschi commented 5 years ago

@rsepassi Thanks for asking. We would really like to be able to access this data from the browser, if possible. In general, we will not have a server, and thus we will not be able to use tfds.datasets for our use case if it requires a python runtime to work. We can guarantee the TensorFlow runtime, however.

If you'll permit a bit of design in a github issue, here is the way I see it:

There are four important data representations that belong in this discussion:

The data may be transferred at any of the above representations, but each representation has its own implications.


The current implementation, if I understand correctly, is that tfds sends data in format A (raw files), and the library includes the instructions and capacity to do the rest. All computation to go from A->B->C->D happens on the client side. This is a problem for the browser, since we aren't sure we have the capacity to include all the required tooling. Even downloading the whole file set at once may be too much to ask, given the constraints on browser storage.

I'd like to suggest we explore the option of streaming data in format C (maps of numerical objects), perhaps transferred over the wire in a channel carrying data in arrow format. We can also send along the tensorflow model which will perform the feature engineering, if necessary, so the client can perform C->D by calling data_in_format_d = feature_model.predict(data_in_format_c)

Does this make sense? Is it within the realm of possibility? Are there negative implications to other use cases?

rsepassi commented 5 years ago

I'd like to suggest we explore the option of streaming data in format C (maps of numerical objects), perhaps transferred over the wire in a channel carrying data in arrow format. We can also send along the tensorflow model which will perform the feature engineering, if necessary, so the client can perform C->D by calling data_in_format_d = feature_model.predict(data_in_format_c)

So we have a TF Graph of the input reading pipeline. If you have a TF runtime that can run that Graph, then we can just ship the Graph and you can run that.

If not, then I like the idea of streaming data in format C, but where is the data streaming from? I was thinking that the Python (running TF) would be the one to stream it out, but that requires a Python (and TF) runtime available. Were you thinking that the data was stored already in Arrow format somewhere? Unfortunately, for most datasets, licensing prevents us from hosting the preprocessed data (which would be considered redistribution). That's why TFDS ships with the logic to download the source data and preprocess it into a standard format.

bileschi commented 5 years ago

Yeah, I was thinking the data could be stored in the processed format. Funny that we can share the data and the instructions on how to prepare it, and we can even provide colabs which will execute those instructions for you, but we can't share the prepared data. I'm sure you've been over this much more than I have but it's still frustrating.

Is there some way I can determine the limited subset of the datasets that can be redistributed? Is there a solution where those could be whitelisted? Alternatively, can we point to a service where the preprocessing will run on the requester's behalf, such that they will not need to perform the A->B->C transform?

Thanks for thinking this through with me.

rsepassi commented 5 years ago

Yes, it is frustrating.

We are working on whitelisting datasets where we do have license to redistribute. For those, we could host the preprocessed format as well. So far, it's not a large set and many of the most popular datasets don't fall into this category (e.g. anything with images pulled from the web without license checks).

I haven't looked into the service approach; how would it work? One idea is that a GCP instance gets spun up on the user's behalf and does the data preprocessing and then can upload/stream it somewhere. That would cost the user $ though.

bileschi commented 5 years ago

Yeah, I'm trying to explore the use case where users are exploring datasets and just playing around. The type of exploration that takes place when people are just beginning to learn ML. I think a paywall would heavily winnow such users.

Do we have any leeway at all in the format we store the data at rest? Can we make it easier to request arbitrary samples in data format B?

rsepassi commented 5 years ago

Well there's no leeway in redistributing, except for the few datasets where we do have license/permission.

Data format B is what's in the TFRecord files in Example protos.

TFRecords are very simple file formats (spec).

And protos in JS should be easy too (docs).

So for those datasets where we can redistribute the data, then I think you can use the preprocessed data as is, and initially provide limited support for parsing/preprocessing at first (e.g. just decoding images).

What do you think?

bileschi commented 5 years ago

Reading from a remote TFRecord is probably within the realm of possibility. For datasets of type A, where we don't have a parser automatically within TensorFlow it might be better to just represent the data as bytes and point the user to instructions rather than package decoding software with the dataset download or within tfjs.datasets.

rsepassi commented 5 years ago

We can't do anything about datasets of type A (again, redistribution is a no-no). For those, the instructions can be to use python -m tensorflow_datasets.scripts.download_and_prepare to generate the TFRecord files somewhere, then they can point tfjs.datasets to that directory. Not all datasets will work perfectly because tfjs will not have all the decode things, but you can probably cover a lot of datasets with just image decode and some reshapes.

bileschi commented 5 years ago

That sounds like a good compromise.

chenqing commented 5 years ago

I am curious what is the progress now. when in tfjs-node , we also need this.