Closed Jeffwan closed 1 year ago
Hey @Jeffwan, yes we support s3 / minio and any remote object storage supported by fsspec.
Reading the data from minio with Dask is one way to do it. This is the easiest way to go if your environment is not configured to automatically connect to the remote storage backend. We provide a wrapper ludwig.utils.data_utils.use_credentials
that simplifies setting credentials:
creds = {
'client_kwargs': {
'aws_access_key_id': 'accessKey',
'aws_secret_access_key': 'mySecretKey',
'endpoint_url': 'http://minio-service:9000'
}
}
with use_credentials(creds):
df = dask.read_csv("s3://...")
The other option is to pass a string. This also works with Minio, but it assumes that your environment is already setup to connect to s3 / minio without specifying any additional credentials. However, the endpoint_url
makes this somewhat tricky with s3fs (see: https://github.com/fsspec/s3fs/issues/432). So for now I recommend providing the credentials explicitly and reading from Dask.
One thing we could do, if it would make things easier, is allow you to provide credentials (either path to credentials file or directly) within the Ludwig config, similar to how we let the user specify the cache credentials:
https://ludwig-ai.github.io/ludwig-docs/0.5/configuration/backend/
Let me know if that would help simplify things.
One last thing to note: it is true that s3fs needs to be installed to connect to s3 / minio. We decided against including this and other libraries in the requirements to save space, but let me know if it would be preferred to bake them in the Docker image.
Let me give ludwig.utils.data_utils.use_credentials
a try. Seems it's equivalent to
dataset_df = dd.read_csv(dataset_path, storage_options= {"client_kwargs": dict(endpoint_url='http://minio-service:9000', 'aws_access_key_id': 'accessKey','aws_secret_access_key': 'mySecretKey' )})
We decided against including this and other libraries in the requirements to save space, but let me know if it would be preferred to bake them in the Docker image.
I asked this question is because I was not sure whether using dask dataframe is the recommended pattern since the image doesn't have it. Now it makes more sense.
I have following envs defined:
AWS_DEFAULT_REGION=us-east-1
AWS_SECRET_ACCESS_KEY=minio123
AWS_ACCESS_KEY_ID=minio
I tried aws_access_key_id
or w/o credential. it shows permission denied.
dataset_path = 's3://automl/hehe/rotten_tomatoes.csv'
minio_creds = {
'client_kwargs': {
'aws_access_key_id': 'minio',
'aws_secret_access_key': 'minio123',
'endpoint_url': 'http://10.227.151.166:30934'
}
}
minio_creds = {
'client_kwargs': {
'endpoint_url': 'http://10.227.151.166:30934'
}
}
with use_credentials(minio_creds):
dataset_df = dd.read_csv(dataset_path)
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 529, in info
out = self._call_s3(self.s3.head_object, kwargs, Bucket=bucket,
File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 200, in _call_s3
return method(**additional_kwargs)
File "/usr/local/lib/python3.8/site-packages/botocore/client.py", line 514, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.8/site-packages/botocore/client.py", line 938, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/local/lib/python3.8/site-packages/dask/dataframe/io/csv.py", line 741, in read
return read_pandas(
File "/usr/local/lib/python3.8/site-packages/dask/dataframe/io/csv.py", line 545, in read_pandas
b_out = read_bytes(
File "/usr/local/lib/python3.8/site-packages/dask/bytes/core.py", line 109, in read_bytes
size = fs.info(path)["size"]
File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 548, in info
raise ee
PermissionError: Forbidden
>>> dataset_df = dd.read_csv(dataset_path, storage_options= {"client_kwargs": dict(endpoint_url='http://10.227.151.166:30934')})
>>> dataset_df.head()
Unnamed: 0 movie_title content_rating ... top_critic review_content recommended
0 283875 Deliver Us from Evil R ... True Director Scott Derrickson and his co-writer, P... 0
1 161408 Barbara PG-13 ... False Somehow, in this stirring narrative, Barbara m... 1
2 423417 Horrible Bosses R ... False These bosses cannot justify either murder or l... 0
3 583216 Money Monster R ... False A satire about television that feels like it w... 0
4 165537 Battle Royale NR ... False Battle Royale is The Hunger Games not diluted ... 1
[5 rows x 8 columns]
Anyway, option2 works for me now.
There's a follow up question. Seems there's some issue reported from dataframe.samples()
. Did I miss any configuration?
>>> dataset_df = dd.read_csv(dataset_path, storage_options= {"client_kwargs": dict(endpoint_url='http://10.227.151.166:30934')})
>>>
>>> automl_config = create_auto_config(
... dataset=dataset_df,
... target='recommended',
... time_limit_s=120,
... tune_for_memory=False,
... user_config=None,
... random_seed=default_random_seed,
... use_reference_config=False,
... )
Initializing new Ray cluster...
2022-09-15 06:40:57,539 INFO services.py:1470 -- View the Ray dashboard at http://127.0.0.1:8265
2022-09-15 06:40:57,583 WARNING services.py:2002 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=2.37gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
[Errno 2] No such file or directory: '/data/Deliver Us from Evil'
While assessing potential image in is_image() for column movie_title, encountered exception: 'NoneType' object has no attribute 'tell'
[Errno 2] No such file or directory: '/data/Barbara'
While assessing potential image in is_image() for column movie_title, encountered exception: 'NoneType' object has no attribute 'tell'
[Errno 2] No such file or directory: '/data/Horrible Bosses'
While assessing potential image in is_image() for column movie_title, encountered exception: 'NoneType' object has no attribute 'tell'
[Errno 2] No such file or directory: '/data/Money Monster'
While assessing potential image in is_image() for column movie_title, encountered exception: 'NoneType' object has no attribute 'tell'
[Errno 2] No such file or directory: '/data/Battle Royale'
While assessing potential image in is_image() for column movie_title, encountered exception: 'NoneType' object has no attribute 'tell'
[Errno 2] No such file or directory: '/data/Mystery, Alaska'
While assessing potential image in is_image() for column movie_title, encountered exception: 'NoneType' object has no attribute 'tell'
[Errno 2] No such file or directory: '/data/Wonder'
While assessing potential image in is_image() for column movie_title, encountered exception: 'NoneType' object has no attribute 'tell'
[Errno 2] No such file or directory: '/data/Woman Walks Ahead'
While assessing potential image in is_image() for column movie_title, encountered exception: 'NoneType' object has no attribute 'tell'
[Errno 2] No such file or directory: '/data/Blood Simple'
While assessing potential image in is_image() for column movie_title, encountered exception: 'NoneType' object has no attribute 'tell'
[Errno 2] No such file or directory: '/data/O Som ao Redor (Neighbouring Sounds)'
While assessing potential image in is_image() for column movie_title, encountered exception: 'NoneType' object has no attribute 'tell'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.8/site-packages/ludwig/automl/automl.py", line 152, in create_auto_config
default_configs, features_metadata = _create_default_config(dataset, target, time_limit_s, random_seed)
File "/usr/local/lib/python3.8/site-packages/ludwig/automl/base_config.py", line 132, in _create_default_config
dataset_info = get_dataset_info(dataset)
File "/usr/local/lib/python3.8/site-packages/ludwig/automl/base_config.py", line 189, in get_dataset_info
return get_dataset_info_from_source(source)
File "/usr/local/lib/python3.8/site-packages/ludwig/automl/base_config.py", line 205, in get_dataset_info_from_source
avg_words = source.get_avg_num_tokens(field)
File "/usr/local/lib/python3.8/site-packages/ludwig/automl/data_source.py", line 70, in get_avg_num_tokens
return avg_num_tokens(self.df[column])
File "/usr/local/lib/python3.8/site-packages/ludwig/automl/utils.py", line 75, in avg_num_tokens
field = field.sample(n=5000, random_state=40)
File "/usr/local/lib/python3.8/site-packages/dask/dataframe/core.py", line 1481, in sample
raise ValueError(msg)
ValueError: sample does not support the number of sampled items parameter, 'n'. Please use the 'frac' parameter instead.
Hey @Jeffwan, I see the issue here. We typically take only the first 10k rows from the Dask DF for the type inference portion to speed things up, but looks like we weren't doing that automatically in this code path. Should be fixed in #2508.
Regarding the issue with use_credentials
, I gave you the format incorrectly. It should actually be as shown in this example: https://github.com/ludwig-ai/ludwig/blob/master/tests/ludwig/utils/test_data_utils.py#L118
So:
s3_creds = {
"s3": {
"client_kwargs": {
"endpoint_url": "http://localhost:9000",
"aws_access_key_id": "test",
"aws_secret_access_key": "test",
}
}
}
with use_credentials(s3_creds):
But if Option 2 works well for your use case, then that works too.
For reading from files given as string paths (so not needing to manually load from Dask), what would be your preferred way to provide the credentials?
I was thinking about adding something to the Ludwig config to specify credentials, like:
backend:
credentials:
s3:
client_kwargs:
endpoint_url: http://localhost:9000
For environment variables, we could provide a syntax similar to Skaffold:
backend:
credentials:
s3:
client_kwargs:
endpoint_url: {{.AWS_ENDPOINT_URL}}
Finally, we could also let the user provide a path:
backend:
credentials:
s3: /data/creds.json
Let me know if any of these would be useful or preferred over reading from Dask directly.
We programmatically generate the config file. I feel either way works for us. My program will receive custom endpoint_url
and the program will override the config in user config. All of above options work for us and I don't have preference at this moment.
I can confirm following ways works fine for my case. The only tricky thing is I need to use credential ENV instead client_kwargs to overcome the following issue.
s3_creds = {
"s3": {
"client_kwargs": {
"endpoint_url": object_storage_endpoint,
# do not pass access_key and secret_key here, cleint_kwargs will be passed to boto3.client, so we will get
# TypeError: create_client() got multiple values for keyword argument 'aws_access_key_id' error if they are configured.
# Let the client to read from Env.
# "aws_access_key_id": os.environ['AWS_ACCESS_KEY_ID'],
# "aws_secret_access_key": os.environ['AWS_SECRET_ACCESS_KEY'],
}
}
}
with use_credentials(s3_creds):
xxxx # my logic
If you have following config support in future, it would save us additional efforts configuring s3_creds. This is not a blocking issue and I will close this issue now.
backend:
credentials:
s3:
client_kwargs:
endpoint_url: {{.AWS_ENDPOINT_URL}}
Is your feature request related to a problem? Please describe.
I want to use remote dataset hosted in S3 or minio. Do you have any examples? Seems most examples in ludwig website are inbuilt dataset or local files. Do you have guidance using S3 or minio?
import dask.dataframe as dd; dataset_df = dd.read_csv('s3://bucket/myfiles.*.csv')
but notice I have to handles3fs
(required by dask). Is this a right way or there's easier way?endpoint
andsignature
Describe the use case Use remote dataset
Describe the solution you'd like Provide an easy to use wrapper.