run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.44k stars 5k forks source link

[Bug]: Persisting to s3 storage is causing "botocore.exceptions.ClientError: An error occurred (InvalidPart) when calling the CompleteMultipartUpload operation: All non-trailing parts must have the same length." #7241

Closed scoobyd0027 closed 7 months ago

scoobyd0027 commented 1 year ago

Bug Description

This issue happens with only specific files. Following this tutorial https://gpt-index.readthedocs.io/en/latest/core_modules/data_modules/storage/save_load.html and using the same code I am trying to upload the index of this file to s3 https://file.io/fWYseeDdhaTD.

s3fs - 2023.6.0

Version

v0.8.0

Steps to Reproduce

Here is the same code, I am using cloudflare R2 btw. This file can be used to index and upload https://file.io/fWYseeDdhaTD.

`import os, s3fs from llama_index.indices.vector_store.base import VectorStoreIndex from llama_index import SimpleDirectoryReader

documents = SimpleDirectoryReader(input_files=["biochemistry.pdf"]).load_data() index = VectorStoreIndex.from_documents(documents)

AWS_KEY = os.environ['AWS_ACCESS_KEY_ID'] AWS_SECRET = os.environ['AWS_SECRET_ACCESS_KEY'] R2_ACCOUNT_ID = os.environ['R2_ACCOUNT_ID']

assert AWS_KEY is not None and AWS_KEY != ""

s3 = s3fs.S3FileSystem( key=AWS_KEY, secret=AWS_SECRET, endpoint_url=f'https://{R2_ACCOUNT_ID}.r2.cloudflarestorage.com', s3_additional_kwargs={'ACL': 'public-read'} )

index.set_index_id("vector_index") index.storage_context.persist('llama-index/storage_demo', fs=s3)`

Relevant Logs/Tracbacks

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 113, in _error_wrapper
    return await func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/aiobotocore/client.py", line 378, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (InvalidPart) when calling the CompleteMultipartUpload operation: All non-trailing parts must have the same length.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "persist_s3_issue.py", line 25, in <module>
    index.storage_context.persist('monic-chats/bio', fs=s3)
  File "/usr/local/lib/python3.8/site-packages/llama_index/storage/storage_context.py", line 118, in persist
    self.vector_store.persist(persist_path=vector_store_path, fs=fs)
  File "/usr/local/lib/python3.8/site-packages/llama_index/vector_stores/simple.py", line 191, in persist
    json.dump(self._data.to_dict(), f)
  File "/usr/local/lib/python3.8/site-packages/fsspec/spec.py", line 1876, in close
    self.flush(force=True)
  File "/usr/local/lib/python3.8/site-packages/fsspec/spec.py", line 1747, in flush
    if self._upload_chunk(final=force) is not False:
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 2250, in _upload_chunk
    self.commit()
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 2278, in commit
    write_result = self._call_s3(
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 2117, in _call_s3
    return self.fs.call_s3(method, self.s3_additional_kwargs, *kwarglist, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/fsspec/asyn.py", line 121, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/fsspec/asyn.py", line 106, in sync
    raise return_result
  File "/usr/local/lib/python3.8/site-packages/fsspec/asyn.py", line 61, in _runner
    result[0] = await coro
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 348, in _call_s3
    return await _error_wrapper(
  File "/usr/local/lib/python3.8/site-packages/s3fs/core.py", line 140, in _error_wrapper
    raise err
OSError: [Errno 22] All non-trailing parts must have the same length.
dosubot[bot] commented 1 year ago

Answer generated by a 🤖

Hi there,

Thanks for reaching out and providing detailed information about the issue you're facing. The error you're encountering is related to the AWS S3 multipart upload. When you upload a file in parts, all parts except the last part must be of the same size. The error message "All non-trailing parts must have the same length" indicates that this condition is not being met.

The persist method in the StorageContext class of the LlamaIndex codebase is designed to save the current state of the storage context to disk. If a file system instance is provided, it will be used to save the storage context. In your case, you're using s3fs as the file system instance to save the storage context to S3.

The issue might be related to the size of the file you're trying to upload or the way s3fs is handling the upload. It's possible that the file is not being split into equal parts for the upload, causing the error.

Unfortunately, I couldn't find a similar issue in the LlamaIndex repository. However, I would suggest trying to upload a smaller file to see if the issue persists. If the issue is indeed related to the file size, you might need to adjust the way the file is split for the upload.

Also, please ensure that you're using the latest versions of s3fs and botocore, as this might be a bug that has been fixed in a later version.

I hope this helps! If you have any other questions or if the issue persists, please let me know.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

scoobyd0027 commented 1 year ago

Hey team, I still need help, this happens randomly for different files, upgraded to latest boto3, botocore, s3fs, llama-index

mattmcgiv commented 1 year ago

Hi @scoobyd0027 I can help with this. So I can troubleshoot the issue, would you provide any files that reliably cause the multi part upload exception? It seems like the file.io links you previously provided have expired (it says the files have been deleted).

scoobyd0027 commented 11 months ago

Hey @mattmcgiv, thanks for reaching out, here is the file which is constantly failing https://drive.google.com/file/d/1lrGL6Aj_f6Tu_3GTDrBWvSw9e0BgXHQF/view?usp=sharing

mattmcgiv commented 11 months ago

Thanks @scoobyd0027 ... taking a look.

mattmcgiv commented 11 months ago

It seems that R2 is the issue here. R2 must not be fully S3-compatible, because I can swap out an Amazon S3 bucket in place of the R2 bucket in your sample code, and the upload works fine. The [Errno 22] All non-trailing parts must have the same length. error seems to be coming from the R2 API, however based on S3 documentation, there is no such constraint on multi-part uploads that I can find. I'd recommend switching out your infrastructure from CloudFlare's R2 to Amazon S3 to get your use case working, @scoobyd0027 .

For the llama-index documentation [1], would we be in favor of making that change as well (using S3 instead of R2)? If so, I'm happy to take on the PR.

[1] https://gpt-index.readthedocs.io/en/latest/core_modules/data_modules/storage/save_load.html

dosubot[bot] commented 7 months ago

Hi, @scoobyd0027,

I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale. From what I understand, you encountered an error related to persisting specific files to S3 storage using the s3fs library, and the community provided relevant code, version information, and traceback logs. After detailed discussions, it was identified that the issue was related to the R2 infrastructure not being fully S3-compatible. The recommendation was to switch to Amazon S3, with a proposal to update the llama-index documentation accordingly.

Could you please confirm if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and contributions to the LlamaIndex project.

I (Dosu)