langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.56k stars 14.82k forks source link

Detected a JSON file that does not conform to the Unstructured schema. partition_json currently only processes serialized Unstructured output while using langchain S3DirectoryLoader #24588

Open nadeems2024 opened 2 months ago

nadeems2024 commented 2 months ago

Checked other resources

Example Code

loader = S3DirectoryLoader(bucket=s3_bucket_name, prefix=s3_prefix) try: documents = loader.load() logging.info(f"size of the loaded documents {len(documents)}")

except Exception as e:
    logging.info(f"error loading documents: {e}")

Error Message and Stack Trace (if applicable)

Detected a JSON file that does not conform to the Unstructured schema. partition_json currently only processes serialized Unstructured output.

doc = loader.load() ^^^^^^^^^^^^^ File "/prj/.venv/lib/python3.12/site-packages/langchain_community/document_loaders/s3_directory.py", line 139, in load docs.extend(loader.load()) ^^^^^^^^^^^^^ File "/prj/.venv/lib/python3.12/site-packages/langchain_core/document_loaders/base.py", line 30, in load return list(self.lazy_load()) ^^^^^^^^^^^^^^^^^^^^^^ File "/prj/.venv/lib/python3.12/site-packages/langchain_community/document_loaders/unstructured.py", line 89, in lazy_load elements = self._get_elements() ^^^^^^^^^^^^^^^^^^^^ File "/prj/.venv/lib/python3.12/site-packages/langchain_community/document_loaders/s3_file.py", line 135, in _get_elements return partition(filename=file_path, **self.unstructured_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/prj/.venv/lib/python3.12/site-packages/unstructured/partition/auto.py", line 389, in partition raise ValueError( ValueError: Detected a JSON file that does not conform to the Unstructured schema. partition_json currently only processes serialized Unstructured output.

Description

My S3 bucket has a single folder, this folder contains json files. Bucket name: "abc-bc-name" Prefix: "output"

file content is json

{ "abc": "This is a text json file", "source": "https://asf.test/4865422_f4866011606d84f50d10e60e0b513b7", "correlation_id": "4865422_f4866011606d84f50d10e60e0b513b7" }

System Info

langchain==0.2.10 langchain-cli==0.0.25 langchain-community==0.2.9 langchain-core==0.2.22 langchain-openai==0.1.17 langchain-text-splitters==0.2.2

macOS Python 3.12.0

slliao445 commented 2 months ago

我上传了一个txt文件,也报了这个错误

nadeems2024 commented 1 month ago

bump !

daniux commented 1 month ago

Facing the same issue here as well. Using "langchain_community.document_loaders.S3DirectoryLoader".