tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.28k stars 1.53k forks source link

need help to build a dataset from local numpy data #5330

Closed CharlieLi2S closed 5 months ago

CharlieLi2S commented 6 months ago

What I need help with / What I was wondering I intended to build a dataset modifying language_table_sim by embedding the natural language instructions of the original dataset. Here's what I've done: I downloaded language_table_sim and loaded and saved the data as numpy files:

data ├── train │ ├── episode_0 │ ├── episode_1 │ ├── … ├── val │ ├── episode_100 │ ├── episode_101 │ ├── … ├── test │ ├── episode_201 │ ├── episode_202 │ ├── … ————————————————

then I write the builder script and placed it under the same dir with data I tried to build the dataset by commanding

tfds build --data_dir \home\ds

while the error occurs:

2024-03-21 04:09:22.381563: W tensorflow/tsl/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata.google.internal".

it seems that the builder is connecting to google storage, but I belive I don't need that, because all the data are local I've searched related issues such as https://github.com/tensorflow/datasets/issues/5194#issue-2043792034 but I can't fix it so I really appreciate that if you can help me

here's my builder script: """language_table_use_dataset_builder.py"""

from typing import Iterator, Tuple, Any

import glob import numpy as np import tensorflow as tf import tensorflow_datasets as tfds import tensorflow_hub as hub

class LanguageTableUse(tfds.core.GeneratorBasedBuilder): """DatasetBuilder for example dataset."""

VERSION = tfds.core.Version('1.0.0')
RELEASE_NOTES = {
  '1.0.0': 'Initial release.',
}

def __init__(self, *args, **kwargs):
    super().__init__(*args, **kwargs)
    self._embed = hub.load("/home/universal_sentence_encoder")

def _info(self) -> tfds.core.DatasetInfo:
    """Dataset metadata (homepage, citation,...)."""
    return self.dataset_info_from_configs(
        features=tfds.features.FeaturesDict({
            'steps': tfds.features.Dataset({
                'observation': tfds.features.FeaturesDict({
                    'rgb': tfds.features.Image(
                        shape=(360, 640, 3),
                        dtype=np.uint8,
                        doc='RGB observation.',
                    ),
                    'effector_target_translation': tfds.features.Tensor(
                        shape=(2,),
                        dtype=np.float32,
                        doc='robot effector target,like x,y in the 2-D dimension',
                    ),
                    'effector_translation': tfds.features.Tensor(
                        shape=(2,),
                        dtype=np.float32,
                        doc='robot effector state,like x,y in the 2-D dimension',
                    ),
                    'instruction': tfds.features.Tensor(
                        shape=(512,),
                        dtype=np.float32,
                        doc='universial sentence embedding instruction',
                    ),
                }),
                'action': tfds.features.Tensor(
                    shape=(2,),
                    dtype=np.float32,
                    doc='Robot action',
                ),
                'reward': tfds.features.Scalar(
                    dtype=np.float32,
                    doc='Reward if provided, 1 on final step for demos.'
                ),
                'is_first': tfds.features.Scalar(
                    dtype=np.bool_,
                    doc='True on first step of the episode.'
                ),
                'is_last': tfds.features.Scalar(
                    dtype=np.bool_,
                    doc='True on last step of the episode.'
                ),
                'is_terminal': tfds.features.Scalar(
                    dtype=np.bool_,
                    doc='True on last step of the episode if it is a terminal step, True for demos.'
                ),
            }),
            'episode_metadata': tfds.features.FeaturesDict({
                'file_path': tfds.features.Text(
                    doc='Path to the original data file.'
                ),
            }),
        }))

def _split_generators(self, dl_manager: tfds.download.DownloadManager):
    """Define data splits."""
    return {
        'train': self._generate_examples(path='data/train/episode_*.npy'),
        'val': self._generate_examples(path='data/val/episode_*.npy'),
    'test': self._generate_examples(path='data/test/episode_*.npy'),
    }

def _generate_examples(self, path) -> Iterator[Tuple[str, Any]]:
    """Generator of examples for each split."""

    def _parse_example(episode_path):
        # load raw data --> this should change for your dataset
        with open(episode_path,'rb') as file:
            data = np.load(file, allow_pickle=True)     # this is a list of dicts in our case

        def decode_inst(inst):
            return bytes(inst[np.where(inst != 0)].tolist()).decode("utf-8")

        # assemble episode --> here we're assuming demos so we set reward to 1 at the end
        episode = []
        for i, step in enumerate(data):
            # compute Kona language embedding
            language_embedding = self._embed([decode_inst(np.array(step['instruction']))])[0].numpy()

            episode.append({
                'observation': {
                    'rgb': step['rgb'],
                    'effector_target_translation': step['effector_target_translation'],
                    'effector_translation': step['effector_translation'],
                    'instruction': language_embedding,
                },
                'action': step['action'],
                'reward': step['reward'],
                'is_first': step['is_first'],
                'is_last': step['is_last'],
                'is_terminal': step['is_terminal'],
            })

        # create output data sample
        sample = {
            'steps': episode,
            'episode_metadata': {
                'file_path': episode_path
            }
        }

        # if you want to skip an example for whatever reason, simply return None
        return episode_path, sample

    # create list of all examples
    episode_paths = glob.glob(path)

    # for smallish datasets, use single-thread parsing
    for sample in episode_paths:
        yield _parse_example(sample)

    # for large datasets use beam to parallelize data parsing (this will have initialization overhead)
    # beam = tfds.core.lazy_imports.apache_beam
    # return (
    #         beam.Create(episode_paths)
    #         | beam.Map(_parse_example)
    # )

Environment information I was running it inside a docker container, the image is tensorflow 2.14-gpu, and I've install several packages myself

rishusam commented 6 months ago

The error message you encountered suggests that TensorFlow is trying to authenticate with Google Cloud services to retrieve authentication tokens, but it's unable to do so because it's running in an environment where it can't access the necessary credentials.

Since you're working with a local dataset and don't need to interact with Google Cloud storage, you can disable the Google authentication by setting the environment variable GOOGLE_APPLICATION_CREDENTIALS to an empty string before running your script.

Here's how you can modify your script to disable Google authentication:

import os

# Disable Google Cloud authentication
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = ''

# Now import the required modules
import glob
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_hub as hub

# Your existing code follows...

Adding this code at the beginning of your script will prevent TensorFlow from attempting to authenticate with Google Cloud services, and it should resolve the error you encountered.

Additionally, make sure that the file paths you're providing in your script ('/home/universal_sentence_encoder') are correct and accessible within your Docker container environment.

Once you've made these modifications, try running your script again:

python language_table_use_dataset_builder.py

This should allow your script to build the dataset without encountering authentication errors. If you encounter any further issues, please let me know, and I'll be happy to assist you further.

CharlieLi2S commented 6 months ago

The error message you encountered suggests that TensorFlow is trying to authenticate with Google Cloud services to retrieve authentication tokens, but it's unable to do so because it's running in an environment where it can't access the necessary credentials.

Since you're working with a local dataset and don't need to interact with Google Cloud storage, you can disable the Google authentication by setting the environment variable GOOGLE_APPLICATION_CREDENTIALS to an empty string before running your script.

Here's how you can modify your script to disable Google authentication:

import os

# Disable Google Cloud authentication
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = ''

# Now import the required modules
import glob
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_hub as hub

# Your existing code follows...

Adding this code at the beginning of your script will prevent TensorFlow from attempting to authenticate with Google Cloud services, and it should resolve the error you encountered.

Additionally, make sure that the file paths you're providing in your script ('/home/universal_sentence_encoder') are correct and accessible within your Docker container environment.

Once you've made these modifications, try running your script again:

python language_table_use_dataset_builder.py

This should allow your script to build the dataset without encountering authentication errors. If you encounter any further issues, please let me know, and I'll be happy to assist you further.

Thanks for your suggestion, but it seems that the issue is still remaining

fylux commented 6 months ago

Hi,

Can you confirm that this message is related to your dataset preparation not working? Seems like it could be just a warning that shouldn't prevent the script from executing (see https://github.com/tensorflow/datasets/issues/2761)

You can also try using:

tfds.core.utils.gcs_utils._is_gcs_disabled = True
os.environ['NO_GCE_CHECK'] = 'true'

To avoid the error message

Let us know if it works.

CharlieLi2S commented 5 months ago

Hi,

Can you confirm that this message is related to your dataset preparation not working? Seems like it could be just a warning that shouldn't prevent the script from executing (see #2761)

You can also try using:

tfds.core.utils.gcs_utils._is_gcs_disabled = True
os.environ['NO_GCE_CHECK'] = 'true'

To avoid the error message

Let us know if it works.

it works, thanks