Closed scoobyd0027 closed 7 months ago
Answer generated by a 🤖
Hi there,
Thanks for reaching out and providing detailed information about the issue you're facing. The error you're encountering is related to the AWS S3 multipart upload. When you upload a file in parts, all parts except the last part must be of the same size. The error message "All non-trailing parts must have the same length" indicates that this condition is not being met.
The persist
method in the StorageContext
class of the LlamaIndex codebase is designed to save the current state of the storage context to disk. If a file system instance is provided, it will be used to save the storage context. In your case, you're using s3fs
as the file system instance to save the storage context to S3.
The issue might be related to the size of the file you're trying to upload or the way s3fs
is handling the upload. It's possible that the file is not being split into equal parts for the upload, causing the error.
Unfortunately, I couldn't find a similar issue in the LlamaIndex repository. However, I would suggest trying to upload a smaller file to see if the issue persists. If the issue is indeed related to the file size, you might need to adjust the way the file is split for the upload.
Also, please ensure that you're using the latest versions of s3fs
and botocore
, as this might be a bug that has been fixed in a later version.
I hope this helps! If you have any other questions or if the issue persists, please let me know.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Hey team, I still need help, this happens randomly for different files, upgraded to latest boto3, botocore, s3fs, llama-index
Hi @scoobyd0027 I can help with this. So I can troubleshoot the issue, would you provide any files that reliably cause the multi part upload exception? It seems like the file.io links you previously provided have expired (it says the files have been deleted).
Hey @mattmcgiv, thanks for reaching out, here is the file which is constantly failing https://drive.google.com/file/d/1lrGL6Aj_f6Tu_3GTDrBWvSw9e0BgXHQF/view?usp=sharing
Thanks @scoobyd0027 ... taking a look.
It seems that R2 is the issue here. R2 must not be fully S3-compatible, because I can swap out an Amazon S3 bucket in place of the R2 bucket in your sample code, and the upload works fine. The [Errno 22] All non-trailing parts must have the same length.
error seems to be coming from the R2 API, however based on S3 documentation, there is no such constraint on multi-part uploads that I can find. I'd recommend switching out your infrastructure from CloudFlare's R2 to Amazon S3 to get your use case working, @scoobyd0027 .
For the llama-index documentation [1], would we be in favor of making that change as well (using S3 instead of R2)? If so, I'm happy to take on the PR.
[1] https://gpt-index.readthedocs.io/en/latest/core_modules/data_modules/storage/save_load.html
Hi, @scoobyd0027,
I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale. From what I understand, you encountered an error related to persisting specific files to S3 storage using the s3fs library, and the community provided relevant code, version information, and traceback logs. After detailed discussions, it was identified that the issue was related to the R2 infrastructure not being fully S3-compatible. The recommendation was to switch to Amazon S3, with a proposal to update the llama-index documentation accordingly.
Could you please confirm if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.
Thank you for your understanding and contributions to the LlamaIndex project.
I (Dosu)
Bug Description
This issue happens with only specific files. Following this tutorial https://gpt-index.readthedocs.io/en/latest/core_modules/data_modules/storage/save_load.html and using the same code I am trying to upload the index of this file to s3 https://file.io/fWYseeDdhaTD.
s3fs - 2023.6.0
Version
v0.8.0
Steps to Reproduce
Here is the same code, I am using cloudflare R2 btw. This file can be used to index and upload https://file.io/fWYseeDdhaTD.
`import os, s3fs from llama_index.indices.vector_store.base import VectorStoreIndex from llama_index import SimpleDirectoryReader
documents = SimpleDirectoryReader(input_files=["biochemistry.pdf"]).load_data() index = VectorStoreIndex.from_documents(documents)
AWS_KEY = os.environ['AWS_ACCESS_KEY_ID'] AWS_SECRET = os.environ['AWS_SECRET_ACCESS_KEY'] R2_ACCOUNT_ID = os.environ['R2_ACCOUNT_ID']
assert AWS_KEY is not None and AWS_KEY != ""
s3 = s3fs.S3FileSystem( key=AWS_KEY, secret=AWS_SECRET, endpoint_url=f'https://{R2_ACCOUNT_ID}.r2.cloudflarestorage.com', s3_additional_kwargs={'ACL': 'public-read'} )
index.set_index_id("vector_index") index.storage_context.persist('llama-index/storage_demo', fs=s3)`
Relevant Logs/Tracbacks