Examples to use large remote dataset(s3 or minio)

Is your feature request related to a problem? Please describe.

I want to use remote dataset hosted in S3 or minio. Do you have any examples? Seems most examples in ludwig website are inbuilt dataset or local files. Do you have guidance using S3 or minio?

from ludwig.datasets import mushroom_edibility
dataset_df = mushroom_edibility.load()

import pandas as pd
dataset_df = pd.read_csv(dataset_path)

I personally tried it import dask.dataframe as dd; dataset_df = dd.read_csv('s3://bucket/myfiles.*.csv') but notice I have to handle s3fs (required by dask). Is this a right way or there's easier way?
I also notice dataset accepts string. I am using minio for testing and could I know if it supports minio here? I want to customize endpoint and signature

aws configure set default.s3.signature_version s3v4
aws --endpoint-url http://minio-service:9090 s3 ls

Describe the use case Use remote dataset

Describe the solution you'd like Provide an easy to use wrapper.

Hey @Jeffwan, yes we support s3 / minio and any remote object storage supported by fsspec.

Reading the data from minio with Dask is one way to do it. This is the easiest way to go if your environment is not configured to automatically connect to the remote storage backend. We provide a wrapper ludwig.utils.data_utils.use_credentials that simplifies setting credentials:

creds = {
    'client_kwargs': {
            'aws_access_key_id': 'accessKey',
            'aws_secret_access_key': 'mySecretKey',
            'endpoint_url': 'http://minio-service:9000'
        }
}
with use_credentials(creds):
    df = dask.read_csv("s3://...")

The other option is to pass a string. This also works with Minio, but it assumes that your environment is already setup to connect to s3 / minio without specifying any additional credentials. However, the endpoint_url makes this somewhat tricky with s3fs (see: https://github.com/fsspec/s3fs/issues/432). So for now I recommend providing the credentials explicitly and reading from Dask.

One thing we could do, if it would make things easier, is allow you to provide credentials (either path to credentials file or directly) within the Ludwig config, similar to how we let the user specify the cache credentials:

https://ludwig-ai.github.io/ludwig-docs/0.5/configuration/backend/

Let me know if that would help simplify things.

One last thing to note: it is true that s3fs needs to be installed to connect to s3 / minio. We decided against including this and other libraries in the requirements to save space, but let me know if it would be preferred to bake them in the Docker image.

Let me give ludwig.utils.data_utils.use_credentials a try. Seems it's equivalent to

dataset_df = dd.read_csv(dataset_path, storage_options= {"client_kwargs": dict(endpoint_url='http://minio-service:9000', 'aws_access_key_id': 'accessKey','aws_secret_access_key': 'mySecretKey' )})

We decided against including this and other libraries in the requirements to save space, but let me know if it would be preferred to bake them in the Docker image.

I asked this question is because I was not sure whether using dask dataframe is the recommended pattern since the image doesn't have it. Now it makes more sense.

I have following envs defined:

AWS_DEFAULT_REGION=us-east-1
AWS_SECRET_ACCESS_KEY=minio123
AWS_ACCESS_KEY_ID=minio

Option 1: ludwig.utils.data_utils.use_credential -> failed

I tried aws_access_key_id or w/o credential. it shows permission denied.

dataset_path = 's3://automl/hehe/rotten_tomatoes.csv'

minio_creds = {
    'client_kwargs': {
        'aws_access_key_id': 'minio',
        'aws_secret_access_key': 'minio123',
        'endpoint_url': 'http://10.227.151.166:30934'
    }
}

minio_creds = {
    'client_kwargs': {
        'endpoint_url': 'http://10.227.151.166:30934'
    }
}

with use_credentials(minio_creds):
    dataset_df = dd.read_csv(dataset_path)

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 529, in info
    out = self._call_s3(self.s3.head_object, kwargs, Bucket=bucket,
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 200, in _call_s3
    return method(**additional_kwargs)
  File "/usr/local/lib/python3.8/site-packages/botocore/client.py", line 514, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.8/site-packages/botocore/client.py", line 938, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/local/lib/python3.8/site-packages/dask/dataframe/io/csv.py", line 741, in read
    return read_pandas(
  File "/usr/local/lib/python3.8/site-packages/dask/dataframe/io/csv.py", line 545, in read_pandas
    b_out = read_bytes(
  File "/usr/local/lib/python3.8/site-packages/dask/bytes/core.py", line 109, in read_bytes
    size = fs.info(path)["size"]
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 548, in info
    raise ee
PermissionError: Forbidden

Option 2: storage_options -> success

>>> dataset_df = dd.read_csv(dataset_path, storage_options= {"client_kwargs": dict(endpoint_url='http://10.227.151.166:30934')})
>>> dataset_df.head()
   Unnamed: 0           movie_title content_rating  ... top_critic                                     review_content  recommended
0      283875  Deliver Us from Evil              R  ...       True  Director Scott Derrickson and his co-writer, P...            0
1      161408               Barbara          PG-13  ...      False  Somehow, in this stirring narrative, Barbara m...            1
2      423417       Horrible Bosses              R  ...      False  These bosses cannot justify either murder or l...            0
3      583216         Money Monster              R  ...      False  A satire about television that feels like it w...            0
4      165537         Battle Royale             NR  ...      False  Battle Royale is The Hunger Games not diluted ...            1

[5 rows x 8 columns]

Anyway, option2 works for me now.

There's a follow up question. Seems there's some issue reported from dataframe.samples(). Did I miss any configuration?

https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.sample.html#dask.dataframe.DataFrame.sample


>>> dataset_df = dd.read_csv(dataset_path, storage_options= {"client_kwargs": dict(endpoint_url='http://10.227.151.166:30934')})
>>>
>>> automl_config = create_auto_config(
...     dataset=dataset_df,
...     target='recommended',
...     time_limit_s=120,
...     tune_for_memory=False,
...     user_config=None,
...     random_seed=default_random_seed,
...     use_reference_config=False,
... )
Initializing new Ray cluster...
2022-09-15 06:40:57,539 INFO services.py:1470 -- View the Ray dashboard at http://127.0.0.1:8265
2022-09-15 06:40:57,583 WARNING services.py:2002 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=2.37gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.

[Errno 2] No such file or directory: '/data/Deliver Us from Evil'
While assessing potential image in is_image() for column movie_title, encountered exception: 'NoneType' object has no attribute 'tell'
[Errno 2] No such file or directory: '/data/Barbara'
While assessing potential image in is_image() for column movie_title, encountered exception: 'NoneType' object has no attribute 'tell'
[Errno 2] No such file or directory: '/data/Horrible Bosses'
While assessing potential image in is_image() for column movie_title, encountered exception: 'NoneType' object has no attribute 'tell'
[Errno 2] No such file or directory: '/data/Money Monster'
While assessing potential image in is_image() for column movie_title, encountered exception: 'NoneType' object has no attribute 'tell'
[Errno 2] No such file or directory: '/data/Battle Royale'
While assessing potential image in is_image() for column movie_title, encountered exception: 'NoneType' object has no attribute 'tell'
[Errno 2] No such file or directory: '/data/Mystery, Alaska'
While assessing potential image in is_image() for column movie_title, encountered exception: 'NoneType' object has no attribute 'tell'
[Errno 2] No such file or directory: '/data/Wonder'
While assessing potential image in is_image() for column movie_title, encountered exception: 'NoneType' object has no attribute 'tell'
[Errno 2] No such file or directory: '/data/Woman Walks Ahead'
While assessing potential image in is_image() for column movie_title, encountered exception: 'NoneType' object has no attribute 'tell'
[Errno 2] No such file or directory: '/data/Blood Simple'
While assessing potential image in is_image() for column movie_title, encountered exception: 'NoneType' object has no attribute 'tell'
[Errno 2] No such file or directory: '/data/O Som ao Redor (Neighbouring Sounds)'
While assessing potential image in is_image() for column movie_title, encountered exception: 'NoneType' object has no attribute 'tell'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/ludwig/automl/automl.py", line 152, in create_auto_config
    default_configs, features_metadata = _create_default_config(dataset, target, time_limit_s, random_seed)
  File "/usr/local/lib/python3.8/site-packages/ludwig/automl/base_config.py", line 132, in _create_default_config
    dataset_info = get_dataset_info(dataset)
  File "/usr/local/lib/python3.8/site-packages/ludwig/automl/base_config.py", line 189, in get_dataset_info
    return get_dataset_info_from_source(source)
  File "/usr/local/lib/python3.8/site-packages/ludwig/automl/base_config.py", line 205, in get_dataset_info_from_source
    avg_words = source.get_avg_num_tokens(field)
  File "/usr/local/lib/python3.8/site-packages/ludwig/automl/data_source.py", line 70, in get_avg_num_tokens
    return avg_num_tokens(self.df[column])
  File "/usr/local/lib/python3.8/site-packages/ludwig/automl/utils.py", line 75, in avg_num_tokens
    field = field.sample(n=5000, random_state=40)
  File "/usr/local/lib/python3.8/site-packages/dask/dataframe/core.py", line 1481, in sample
    raise ValueError(msg)
ValueError: sample does not support the number of sampled items parameter, 'n'. Please use the 'frac' parameter instead.

Hey @Jeffwan, I see the issue here. We typically take only the first 10k rows from the Dask DF for the type inference portion to speed things up, but looks like we weren't doing that automatically in this code path. Should be fixed in #2508.

Regarding the issue with use_credentials, I gave you the format incorrectly. It should actually be as shown in this example: https://github.com/ludwig-ai/ludwig/blob/master/tests/ludwig/utils/test_data_utils.py#L118

So:

s3_creds = {
        "s3": {
            "client_kwargs": {
                "endpoint_url": "http://localhost:9000",
                "aws_access_key_id": "test",
                "aws_secret_access_key": "test",
            }
        }
    }
    with use_credentials(s3_creds):

But if Option 2 works well for your use case, then that works too.

For reading from files given as string paths (so not needing to manually load from Dask), what would be your preferred way to provide the credentials?

I was thinking about adding something to the Ludwig config to specify credentials, like:

backend:
    credentials:
        s3:
            client_kwargs:
                endpoint_url: http://localhost:9000

For environment variables, we could provide a syntax similar to Skaffold:

backend:
    credentials:
        s3:
            client_kwargs:
                endpoint_url: {{.AWS_ENDPOINT_URL}}

Finally, we could also let the user provide a path:

backend:
    credentials:
        s3: /data/creds.json

Let me know if any of these would be useful or preferred over reading from Dask directly.

We programmatically generate the config file. I feel either way works for us. My program will receive custom endpoint_url and the program will override the config in user config. All of above options work for us and I don't have preference at this moment.

I can confirm following ways works fine for my case. The only tricky thing is I need to use credential ENV instead client_kwargs to overcome the following issue.

    s3_creds = {
            "s3": {
                "client_kwargs": {
                    "endpoint_url": object_storage_endpoint,
                    # do not pass access_key and secret_key here, cleint_kwargs will be passed to boto3.client, so we will get 
                    # TypeError: create_client() got multiple values for keyword argument 'aws_access_key_id' error if they are configured.
                    # Let the client to read from Env.
                    # "aws_access_key_id": os.environ['AWS_ACCESS_KEY_ID'],
                    # "aws_secret_access_key": os.environ['AWS_SECRET_ACCESS_KEY'],
                }
            }
        }

 with use_credentials(s3_creds):
       xxxx # my logic

If you have following config support in future, it would save us additional efforts configuring s3_creds. This is not a blocking issue and I will close this issue now.

backend:
    credentials:
        s3:
            client_kwargs:
                endpoint_url: {{.AWS_ENDPOINT_URL}}

ludwig-ai / ludwig

Examples to use large remote dataset(s3 or minio) #2497

Option 1: ludwig.utils.data_utils.use_credential -> failed

Option 2: storage_options -> success