Open p9anand opened 4 years ago
Did you try doing that? What did you observe?
yes. i tried following codes:
from petastorm.spark import SparkDatasetConverter, make_spark_converter
spark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF, 's3a://**/***/')
df_train, df_test = train_df, test_df # df_train
and df_test
are spark dataframes
model = Net()
df_train
df_train
to cache dir. (the same for df_test)converter_train = make_spark_converter(df_train) converter_test = make_spark_converter(df_test)
converter_train
with converter_train.make_torch_dataloader() as dataloader_train:
dataloader_train
is torch.utils.data.DataLoader
object#### we can train model using the `dataloader_train`
train(model, dataloader_train, ...)
#### when exiting the context, the reader of the dataset will be closed
converter_test
with converter_test.make_torch_dataloader() as dataloader_test: test(model, dataloader_test, ...)
converter_train.delete() converter_test.delete()
and got following error:
I see that you are using s3a://
protocol. We currently do not support it. Can you use s3://
in your case?
Hi @selitvin ,
Thanks for the response. In my org we use s3a protocol as this support larger files. I can try using s3 protocol. Is there a plan to incorporate s3a protocol?
error while using s3 buckets:
I added s3n and s3a to the list of supported protocols and checked that I could both write and read from my s3 bucket.
Would you like to try it and see if it helps?
pip3 install git+https://github.com/selitvin/petastorm.git@support-s3-s3a-s3n
sure Thanks!
@p9anand those distinctions are only meaningful on Hadoop-based systems (see Technically what is the difference between s3n, s3a, and s3?). Because petastorm
delegates to s3fs
, which in turn uses boto3
(indirectly), everything will be treated with an object-native interface, regardless of the protocol. Since it makes no difference on this side, could you just substitute the protocols to something already accepted? That would be closer to a conceptually clear solution, because there isn't a meaningful way for petastorm
to "support" the other protocols, in the way you mean (i.e.: other than to ignore the distinction and treat them all uniformly).
Saying that, your understanding of what s3a
does in Hadoop already maps to how s3fs
(and, by extension, petastorm
) treats the s3
protocol; i.e.: an object-native interface using AWS's own SDK (boto3
for Python, in this case), and with no arbitrary limit on the file size, beyond S3's own limits. Moreover, you needn't worry about specifying the s3
protocol not being interoperable (because it's a Block FileSystem), because this Hadoop-specific view of the world is pretty dated. In the non-Hadoop world, s3
simply means the same thing as s3a
(except on EMR, where it means emrfs
, but that's out-of-scope).
working perfectly...!! Thanks.
from petastorm.spark import SparkDatasetConverter, make_spark_converter
specify a cache dir first.
the dir is used to save materialized spark dataframe files
_spark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URLCONF, 'hdfs:/...')
### can we use s3 path here instead of hdfs?
df_train, df_test = ... #
df_train
anddf_test
are spark dataframes model = Net()create a converter_train from
df_train
it will materialize
df_train
to cache dir. (the same for df_test)converter_train = make_spark_converter(df_train) converter_test = make_spark_converter(df_test)
make a pytorch dataloader from
converter_train
with converter_train.make_torch_dataloader() as dataloader_train:
the
dataloader_train
istorch.utils.data.DataLoader
objectthe same for
converter_test
with converter_test.make_torch_dataloader() as dataloader_test: test(model, dataloader_test, ...)
delete the cached files of the dataframes.
converter_train.delete() converter_test.delete()