piskvorky / smart_open

Utils for streaming large files (S3, HDFS, gzip, bz2...)
MIT License
3.18k stars 383 forks source link

Support for "append" mode for Azure Blobs #836

Open geovalexis opened 5 days ago

geovalexis commented 5 days ago

Hi all!

I use smart-open for one of my projects and I've recently run into the need for "append" mode for Azure blobs. This is something Azure's SDK supports natively but it looks like it hasn't been implemented in smart-open yet.

I was thinking on adding support for this feature myself but I was wondering if there is any additional concern/inconvenience I might be missing.

P.S.: Thanks for such a simple yet useful tool!

Cheers.

ddelange commented 5 days ago

do you mean creating a new AppendBlob object on azure blob storage, or appending to an existing AppendBlob?

geovalexis commented 5 days ago

I mean appending to an existing AppendBlob.

ddelange commented 5 days ago

I guess if 'a' in mode could make the blind assumption that we're talking about AppendBlob, whether it already exists on remote or not.

geovalexis commented 5 days ago

what would happened if it's not an AppendBlob?

ddelange commented 5 days ago

raise a ValueError immediately on the open() call?

ddelange commented 5 days ago

some concerns:

ddelange commented 4 days ago

I think bullet number 2 is a hard blocker. There's no way to revert an append block operation.

Append Block uploads a block to the end of an existing append blob. The block of data is immediately available after the call succeeds on the server. A maximum of 50,000 appends are permitted for each append blob. Each block can be of different size.

ref https://learn.microsoft.com/en-us/rest/api/storageservices/append-block?tabs=microsoft-entra-id#remarks

The only workaround I can think of is to only start uploading in the close() call (i.e. a successful __exit__) using append_blob_from_stream. But I guess that's an anti-pattern, especially regarding memory usage for big (multi-part) streams and usage with generators and such.

geovalexis commented 2 days ago

@ddelange thanks for sharing your thoughts on this.

Is this "aborting" capability really required? I mean, I think we need to stick to what the API allows us to do. If we can't revert an append block, so be it.

Also, going over the code, I have noticed there is no append mode support for other cloud providers. Is it because they don't support it? Or because similar blockers than this?

ddelange commented 2 days ago

@ddelange thanks for sharing your thoughts on this.

Is this "aborting" capability really required? I mean, I think we need to stick to what the API allows us to do. If we can't revert an append block, so be it.

Setting chunksize default to 100MB for the new AppendWriter would allow for aborting at least appends smaller than 100MB (I guess the gross of applications for this feature) but in any case it's a big caveat that would get introduced with the feature.

Also, going over the code, I have noticed there is no append mode support for other cloud providers. Is it because they don't support it? Or because similar blockers than this?

I'm not a maintainer (just an active contributer) but afaik it's because they only implement immutable objects.

geovalexis commented 22 hours ago

Sounds good @ddelange! I'll try to put something together and see if the maintainers like it.

ddelange commented 22 hours ago

Awesome :) The 100MB is a hard chunk size limit on azure side btw, would have to ensure that bytes going into the append_block never surpass this size. There's also a max amount of blocks that can be appended to an AppendBlob, 50k iirc

ddelange commented 21 hours ago

correction:

Each block in an append blob can be a different size, up to a maximum of 4 MB, and an append blob can include up to 50,000 blocks. The maximum size of an append blob is therefore slightly more than 195 GB (4 MB X 50,000 blocks).

ref https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.appendblobservice.appendblobservice?view=azure-python-previous

smart_open azure.py links to this table, maybe the low defaults we have now is a remnant from before 2019?