uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Apache License 2.0

1.78k stars 285 forks source link

Allow users to use s3, s3a and s3n protocols when saving / reading datasets #601

Closed selitvin closed 4 years ago

selitvin commented 4 years ago

s3, s3a and s3n url protocols can be explicitly specified when saving petatorm datasets.

Fixed a bug on petastorm dataset write execution path previously preventing writing directly to s3 buckets.

Tested: modified examples/generate_external_dataset.py and examples/python_hello_world.py to write/read from s3 bucket using s3a and s3n buckets (wasn't able to properly configure s3 authentication to check that). Was able to write/read data successfully.

codecov[bot] commented 4 years ago

Codecov Report

Merging #601 into master will not change coverage. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #601   +/-   ##
=======================================
  Coverage   85.68%   85.68%           
=======================================
  Files          87       87           
  Lines        4994     4994           
  Branches      795      795           
=======================================
  Hits         4279     4279           
  Misses        578      578           
  Partials      137      137

Impacted Files	Coverage Δ
petastorm/etl/dataset_metadata.py	`87.41% <100.00%> (ø)`
petastorm/fs_utils.py	`91.75% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update ffaf6b6...4e33f99. Read the comment docs.

selitvin commented 4 years ago

@dmcguire81 , can you please help me with this small review as you have a lot of experience with s3 and petastorm. Much appreciated!

dmcguire81 commented 4 years ago

@selitvin this will work, but I'm not sure you want to say that petastorm "supports" those protocols. Please see the comment in #597. When the author asks "instead of hdfs", they specifically think that s3 is some really, really old, Hadoop-specific notion of S3 storage (in 10 years in Big Data, I've not run across anyone using it), and it's not. It would be trivial, and conceptually less confusing, for them to substitute the protocol in their own URLs, as if to say "s3a in hadoop-aws is just s3 in boto3".

Now, if you want to do that for them, it's certainly safe to do so, because boto3's treatment of the s3 protocol should be strictly more powerful than s3n (no file size limitation), and equivalent to s3a. However, introducing this meaningless distinction in Python may just make more users coming from Hadoop/AWS assume that s3 is the older, incompatible protocol, I'm afraid.

dmcguire81 commented 4 years ago

If you still want to move forward after the Issue author responds, I have some s3a prefixes I can also test against, but s3n is going to be a challenge.

selitvin commented 4 years ago

Thank you for the comment! Not sure I understand. Is your comment about "petastorm supports s3/s3n/s3a" refers to the documentation, or code?

According to your suggestion would user always specify s3://..." and petastorm should replace its3a://when passing it down topyarrow.filesystem.S3FSWrapper` (because that's how boto3 is doing it?), or I misunderstood you?

dmcguire81 commented 4 years ago

@selitvin s3a and s3n are meaningless protocol specifiers in the Python stack. The only thing boto3 understands is s3. However, how it treats that protocol is exactly what the author has in mind for how Hadoop treats s3a (it's AWS native SDK code, and there is no arbitrary file limit). In short, asking the author to use s3 in place of s3a in all their URLs, because it's 100% compatible with the storage written, is reasonable, in my view.

dmcguire81 commented 4 years ago

Thank you for the comment! Not sure I understand. Is your comment about "petastorm supports s3/s3n/s3a" refers to the documentation, or code?

Both, since there is implied support in the code.

According to your suggestion would user always specify s3://..." and petastorm should replace its3a://when passing it down topyarrow.filesystem.S3FSWrapper` (because that's how boto3 is doing it?), or I misunderstood you?

Petastorm should stay the same, in this scenario, except for maybe giving a hint that those protocols don't mean anything in Python.

selitvin commented 4 years ago

Updated release notes to the best of my understanding of the problem. Please let me know if it does not make sense and I'll fix it.

selitvin commented 4 years ago

Looks good. What's your general release cadence?

On demand. I think we have enough goodies to cut a release. Unless you have more stuff coming up, I can do it until tomorrow EOD.

dmcguire81 commented 4 years ago

Looks good. What's your general release cadence?

On demand. I think we have enough goodies to cut a release. Unless you have more stuff coming up, I can do it until tomorrow EOD.

Nothing else pending (PyArrow changes look to be extensive). Tomorrow works.

selitvin commented 4 years ago

@dmcguire81 : FYI: 0.9.6rc0 is available on pypi.