tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.3k stars 1.54k forks source link

Add script to export tensorflow_data_validation #2299

Open Conchylicultor opened 4 years ago

Conchylicultor commented 4 years ago

Currently, there is an option to generate TFDV statistics for FACET during download_and_prepare. However, it would be best to separate those two steps.

builder = tfds.builder('dataset')
builder.download_and_prepare()

generate_statistics(builder, dst=builder.data_dir)

Then we should add a new script in scripts/generate_statistics.py which generate statistics for one dataset and export them to gs://. We would automatically generate statistics for all datasets. When user pre-load the DatasetInfo, it would also fetch the statistics.json from GCS, so user could visualize FACET for all default TFDS datasets with the existing tfds.show_statistics.

devil-cyber commented 3 years ago

hi, is this issue still open can I work on this?

Conchylicultor commented 3 years ago

I think this could still be worked on. As a first step, I think we could add a standalone script/function generate_statistics(builder, dst_dir) which generate the statistics and save them in dst_dir / statistics.json.

This way we don't add any more complexity in the core API. Then we can think more about better exposing the statistics directly inside the core API.