Closed selitvin closed 4 years ago
Merging #601 into master will not change coverage. The diff coverage is
100.00%
.
@@ Coverage Diff @@
## master #601 +/- ##
=======================================
Coverage 85.68% 85.68%
=======================================
Files 87 87
Lines 4994 4994
Branches 795 795
=======================================
Hits 4279 4279
Misses 578 578
Partials 137 137
Impacted Files | Coverage Δ | |
---|---|---|
petastorm/etl/dataset_metadata.py | 87.41% <100.00%> (ø) |
|
petastorm/fs_utils.py | 91.75% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update ffaf6b6...4e33f99. Read the comment docs.
@dmcguire81 , can you please help me with this small review as you have a lot of experience with s3 and petastorm. Much appreciated!
@selitvin this will work, but I'm not sure you want to say that petastorm
"supports" those protocols. Please see the comment in #597. When the author asks "instead of hdfs", they specifically think that s3
is some really, really old, Hadoop-specific notion of S3 storage (in 10 years in Big Data, I've not run across anyone using it), and it's not. It would be trivial, and conceptually less confusing, for them to substitute the protocol in their own URLs, as if to say "s3a
in hadoop-aws
is just s3
in boto3
".
Now, if you want to do that for them, it's certainly safe to do so, because boto3
's treatment of the s3
protocol should be strictly more powerful than s3n
(no file size limitation), and equivalent to s3a
. However, introducing this meaningless distinction in Python may just make more users coming from Hadoop/AWS assume that s3
is the older, incompatible protocol, I'm afraid.
If you still want to move forward after the Issue author responds, I have some s3a
prefixes I can also test against, but s3n
is going to be a challenge.
Thank you for the comment! Not sure I understand. Is your comment about "petastorm supports s3/s3n/s3a" refers to the documentation, or code?
According to your suggestion would user always specify s3://..." and petastorm should replace it
s3a://when passing it down to
pyarrow.filesystem.S3FSWrapper` (because that's how boto3 is doing it?), or I misunderstood you?
@selitvin s3a
and s3n
are meaningless protocol specifiers in the Python stack. The only thing boto3
understands is s3
. However, how it treats that protocol is exactly what the author has in mind for how Hadoop treats s3a
(it's AWS native SDK code, and there is no arbitrary file limit). In short, asking the author to use s3
in place of s3a
in all their URLs, because it's 100% compatible with the storage written, is reasonable, in my view.
Thank you for the comment! Not sure I understand. Is your comment about "petastorm supports s3/s3n/s3a" refers to the documentation, or code?
Both, since there is implied support in the code.
According to your suggestion would user always specify
s3://..." and petastorm should replace it
s3a://when passing it down to
pyarrow.filesystem.S3FSWrapper` (because that's how boto3 is doing it?), or I misunderstood you?
Petastorm should stay the same, in this scenario, except for maybe giving a hint that those protocols don't mean anything in Python.
Updated release notes to the best of my understanding of the problem. Please let me know if it does not make sense and I'll fix it.
Looks good. What's your general release cadence?
On demand. I think we have enough goodies to cut a release. Unless you have more stuff coming up, I can do it until tomorrow EOD.
Looks good. What's your general release cadence?
On demand. I think we have enough goodies to cut a release. Unless you have more stuff coming up, I can do it until tomorrow EOD.
Nothing else pending (PyArrow changes look to be extensive). Tomorrow works.
@dmcguire81 : FYI: 0.9.6rc0 is available on pypi.
s3, s3a and s3n url protocols can be explicitly specified when saving petatorm datasets.
Fixed a bug on petastorm dataset write execution path previously preventing writing directly to s3 buckets.
Tested: modified
examples/generate_external_dataset.py
andexamples/python_hello_world.py
to write/read from s3 bucket using s3a and s3n buckets (wasn't able to properly configure s3 authentication to check that). Was able to write/read data successfully.