uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

can we use s3 path here instead of hdfs? #597

Open p9anand opened 4 years ago

p9anand commented 4 years ago

from petastorm.spark import SparkDatasetConverter, make_spark_converter

specify a cache dir first.

the dir is used to save materialized spark dataframe files

_spark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URLCONF, 'hdfs:/...')

### can we use s3 path here instead of hdfs?

df_train, df_test = ... # df_train and df_test are spark dataframes model = Net()

create a converter_train from df_train

it will materialize df_train to cache dir. (the same for df_test)

converter_train = make_spark_converter(df_train) converter_test = make_spark_converter(df_test)

make a pytorch dataloader from converter_train

with converter_train.make_torch_dataloader() as dataloader_train:

the dataloader_train is torch.utils.data.DataLoader object

# we can train model using the `dataloader_train`
train(model, dataloader_train, ...)
# when exiting the context, the reader of the dataset will be closed

the same for converter_test

with converter_test.make_torch_dataloader() as dataloader_test: test(model, dataloader_test, ...)

delete the cached files of the dataframes.

converter_train.delete() converter_test.delete()

selitvin commented 4 years ago

Did you try doing that? What did you observe?

p9anand commented 4 years ago

yes. i tried following codes:

from petastorm.spark import SparkDatasetConverter, make_spark_converter

specify a cache dir first.

the dir is used to save materialized spark dataframe files

spark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF, 's3a://**/***/')

df_train, df_test = train_df, test_df # df_train and df_test are spark dataframes model = Net()

create a converter_train from df_train

it will materialize df_train to cache dir. (the same for df_test)

converter_train = make_spark_converter(df_train) converter_test = make_spark_converter(df_test)

make a pytorch dataloader from converter_train

with converter_train.make_torch_dataloader() as dataloader_train:

the dataloader_train is torch.utils.data.DataLoader object

#### we can train model using the `dataloader_train`
train(model, dataloader_train, ...)
#### when exiting the context, the reader of the dataset will be closed
the same for converter_test

with converter_test.make_torch_dataloader() as dataloader_test: test(model, dataloader_test, ...)

delete the cached files of the dataframes.

converter_train.delete() converter_test.delete()

and got following error:

image image
selitvin commented 4 years ago

I see that you are using s3a:// protocol. We currently do not support it. Can you use s3:// in your case?

p9anand commented 4 years ago

Hi @selitvin ,

Thanks for the response. In my org we use s3a protocol as this support larger files. I can try using s3 protocol. Is there a plan to incorporate s3a protocol?

p9anand commented 4 years ago

error while using s3 buckets:

image image image image
selitvin commented 4 years ago

I added s3n and s3a to the list of supported protocols and checked that I could both write and read from my s3 bucket.

Would you like to try it and see if it helps?

pip3 install git+https://github.com/selitvin/petastorm.git@support-s3-s3a-s3n
p9anand commented 4 years ago

sure Thanks!

dmcguire81 commented 4 years ago

@p9anand those distinctions are only meaningful on Hadoop-based systems (see Technically what is the difference between s3n, s3a, and s3?). Because petastorm delegates to s3fs, which in turn uses boto3 (indirectly), everything will be treated with an object-native interface, regardless of the protocol. Since it makes no difference on this side, could you just substitute the protocols to something already accepted? That would be closer to a conceptually clear solution, because there isn't a meaningful way for petastorm to "support" the other protocols, in the way you mean (i.e.: other than to ignore the distinction and treat them all uniformly).

Saying that, your understanding of what s3a does in Hadoop already maps to how s3fs (and, by extension, petastorm) treats the s3 protocol; i.e.: an object-native interface using AWS's own SDK (boto3 for Python, in this case), and with no arbitrary limit on the file size, beyond S3's own limits. Moreover, you needn't worry about specifying the s3 protocol not being interoperable (because it's a Block FileSystem), because this Hadoop-specific view of the world is pretty dated. In the non-Hadoop world, s3 simply means the same thing as s3a (except on EMR, where it means emrfs, but that's out-of-scope).

p9anand commented 4 years ago

working perfectly...!! Thanks.