Open geovalexis opened 2 months ago
do you mean creating a new AppendBlob object on azure blob storage, or appending to an existing AppendBlob?
I mean appending to an existing AppendBlob.
I guess if 'a' in mode
could make the blind assumption that we're talking about AppendBlob, whether it already exists on remote or not.
what would happened if it's not an AppendBlob?
raise a ValueError
immediately on the open()
call?
some concerns:
append_block
would have to go into _upload_part
, which would probably warrant a new AppendWriter
(sub)class conform this mechanic.terminate()
method needs to be amended such that the whole append operation gets aborted upon terminate()
call (i.e. with-statement is aborted with any exception). that means that append_block
might not be usable, as it probably already commits upon each call.I think bullet number 2 is a hard blocker. There's no way to revert an append block operation.
Append Block
uploads a block to the end of an existing append blob. The block of data is immediately available after the call succeeds on the server. A maximum of 50,000 appends are permitted for each append blob. Each block can be of different size.
The only workaround I can think of is to only start uploading in the close()
call (i.e. a successful __exit__
) using append_blob_from_stream
. But I guess that's an anti-pattern, especially regarding memory usage for big (multi-part) streams and usage with generators and such.
@ddelange thanks for sharing your thoughts on this.
Is this "aborting" capability really required? I mean, I think we need to stick to what the API allows us to do. If we can't revert an append block, so be it.
Also, going over the code, I have noticed there is no append mode support for other cloud providers. Is it because they don't support it? Or because similar blockers than this?
@ddelange thanks for sharing your thoughts on this.
Is this "aborting" capability really required? I mean, I think we need to stick to what the API allows us to do. If we can't revert an append block, so be it.
Setting chunksize default to 100MB for the new AppendWriter
would allow for aborting at least appends smaller than 100MB (I guess the gross of applications for this feature) but in any case it's a big caveat that would get introduced with the feature.
Also, going over the code, I have noticed there is no append mode support for other cloud providers. Is it because they don't support it? Or because similar blockers than this?
I'm not a maintainer (just an active contributer) but afaik it's because they only implement immutable objects.
Sounds good @ddelange! I'll try to put something together and see if the maintainers like it.
Awesome :) The 100MB is a hard chunk size limit on azure side btw, would have to ensure that bytes going into the append_block
never surpass this size. There's also a max amount of blocks that can be appended to an AppendBlob, 50k iirc
correction:
Each block in an append blob can be a different size, up to a maximum of 4 MB, and an append blob can include up to 50,000 blocks. The maximum size of an append blob is therefore slightly more than 195 GB (4 MB X 50,000 blocks).
smart_open azure.py links to this table, maybe the low defaults we have now is a remnant from before 2019?
Hi all!
I use smart-open for one of my projects and I've recently run into the need for "append" mode for Azure blobs. This is something Azure's SDK supports natively but it looks like it hasn't been implemented in smart-open yet.
I was thinking on adding support for this feature myself but I was wondering if there is any additional concern/inconvenience I might be missing.
P.S.: Thanks for such a simple yet useful tool!
Cheers.