langchain-ai / langchain

πŸ¦œπŸ”— Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.04k stars 14.65k forks source link

DirectoryLoader use_multithreading inconsistent behavior between true and false (and issue with UnstructuredFileLoader and .json files) #15731

Closed cgalo5758 closed 1 month ago

cgalo5758 commented 8 months ago

System Info

requirements.txt - aiohttp==3.9.1 - aiosignal==1.3.1 - annotated-types==0.6.0 - anyio==4.2.0 - argon2-cffi==23.1.0 - argon2-cffi-bindings==21.2.0 - arrow==1.3.0 - asgiref==3.7.2 - asttokens==2.4.1 - async-lru==2.0.4 - attrs==23.2.0 - Babel==2.14.0 - backoff==2.2.1 - bcrypt==4.1.2 - beautifulsoup4==4.12.2 - bleach==6.1.0 - build==1.0.3 - cachetools==5.3.2 - certifi==2023.11.17 - cffi==1.16.0 - chardet==5.2.0 - charset-normalizer==3.3.2 - chroma-hnswlib==0.7.3 - chromadb==0.4.22 - click==8.1.7 - coloredlogs==15.0.1 - comm==0.2.1 - dataclasses-json==0.6.3 - debugpy==1.8.0 - decorator==5.1.1 - defusedxml==0.7.1 - Deprecated==1.2.14 - distro==1.9.0 - docarray==0.40.0 - emoji==2.9.0 - executing==2.0.1 - fastapi==0.108.0 - fastjsonschema==2.19.1 - filelock==3.13.1 - filetype==1.2.0 - flatbuffers==23.5.26 - fqdn==1.5.1 - frozenlist==1.4.1 - fsspec==2023.12.2 - gitdb==4.0.11 - GitPython==3.1.40 - google-auth==2.26.1 - googleapis-common-protos==1.62.0 - greenlet==3.0.3 - grpcio==1.60.0 - h11==0.14.0 - httpcore==1.0.2 - httptools==0.6.1 - httpx==0.26.0 - huggingface-hub==0.20.2 - humanfriendly==10.0 - idna==3.6 - importlib-metadata==6.11.0 - importlib-resources==6.1.1 - ipykernel==6.28.0 - ipython==8.19.0 - isoduration==20.11.0 - jedi==0.19.1 - Jinja2==3.1.2 - joblib==1.3.2 - json5==0.9.14 - jsonpatch==1.33 - jsonpath-python==1.0.6 - jsonpointer==2.4 - jsonschema==4.20.0 - jsonschema-specifications==2023.12.1 - jupyter-events==0.9.0 - jupyter-lsp==2.2.1 - jupyter_client==8.6.0 - jupyter_core==5.7.0 - jupyter_server==2.12.2 - jupyter_server_terminals==0.5.1 - jupyterlab==4.0.10 - jupyterlab_pygments==0.3.0 - jupyterlab_server==2.25.2 - kubernetes==28.1.0 - langchain==0.1.0 - langchain-community==0.0.9 - langchain-core==0.1.7 - langchain-openai==0.0.2 - langdetect==1.0.9 - langsmith==0.0.77 - lxml==5.1.0 - Markdown==3.5.1 - markdown-it-py==3.0.0 - MarkupSafe==2.1.3 - marshmallow==3.20.1 - matplotlib-inline==0.1.6 - mdurl==0.1.2 - mistune==3.0.2 - mmh3==4.0.1 - monotonic==1.6 - mpmath==1.3.0 - multidict==6.0.4 - mypy-extensions==1.0.0 - nbclient==0.9.0 - nbconvert==7.14.0 - nbformat==5.9.2 - nest-asyncio==1.5.8 - nltk==3.8.1 - notebook_shim==0.2.3 - numpy==1.26.3 - oauthlib==3.2.2 - onnxruntime==1.16.3 - openai==1.6.1 - opentelemetry-api==1.22.0 - opentelemetry-exporter-otlp-proto-common==1.22.0 - opentelemetry-exporter-otlp-proto-grpc==1.22.0 - opentelemetry-instrumentation==0.43b0 - opentelemetry-instrumentation-asgi==0.43b0 - opentelemetry-instrumentation-fastapi==0.43b0 - opentelemetry-proto==1.22.0 - opentelemetry-sdk==1.22.0 - opentelemetry-semantic-conventions==0.43b0 - opentelemetry-util-http==0.43b0 - orjson==3.9.10 - overrides==7.4.0 - packaging==23.2 - pandocfilters==1.5.0 - parso==0.8.3 - pexpect==4.9.0 - platformdirs==4.1.0 - posthog==3.1.0 - prometheus-client==0.19.0 - prompt-toolkit==3.0.43 - protobuf==4.25.1 - psutil==5.9.7 - ptyprocess==0.7.0 - pulsar-client==3.4.0 - pure-eval==0.2.2 - pyasn1==0.5.1 - pyasn1-modules==0.3.0 - pycparser==2.21 - pydantic==2.5.3 - pydantic_core==2.14.6 - Pygments==2.17.2 - PyPika==0.48.9 - pyproject_hooks==1.0.0 - python-dateutil==2.8.2 - python-dotenv==1.0.0 - python-iso639==2024.1.2 - python-json-logger==2.0.7 - python-magic==0.4.27 - PyYAML==6.0.1 - pyzmq==25.1.2 - rapidfuzz==3.6.1 - referencing==0.32.1 - regex==2023.12.25 - requests==2.31.0 - requests-oauthlib==1.3.1 - rfc3339-validator==0.1.4 - rfc3986-validator==0.1.1 - rich==13.7.0 - rpds-py==0.16.2 - rsa==4.9 - Send2Trash==1.8.2 - six==1.16.0 - smmap==5.0.1 - sniffio==1.3.0 - soupsieve==2.5 - SQLAlchemy==2.0.25 - stack-data==0.6.3 - starlette==0.32.0.post1 - sympy==1.12 - tabulate==0.9.0 - tenacity==8.2.3 - terminado==0.18.0 - tiktoken==0.5.2 - tinycss2==1.2.1 - tokenizers==0.15.0 - tornado==6.4 - tqdm==4.66.1 - traitlets==5.14.1 - typer==0.9.0 - types-python-dateutil==2.8.19.20240106 - types-requests==2.31.0.20240106 - typing-inspect==0.9.0 - typing_extensions==4.9.0 - unstructured==0.11.8 - unstructured-client==0.15.2 - uri-template==1.3.0 - urllib3==1.26.18 - uvicorn==0.25.0 - uvloop==0.19.0 - watchfiles==0.21.0 - wcwidth==0.2.13 - webcolors==1.13 - webencodings==0.5.1 - websocket-client==1.7.0 - websockets==12.0 - wrapt==1.16.0 - yarl==1.9.4 - zipp==3.17.0

Who can help?

@ey

Information

Related Components

Reproduction

Code sample to reproduce where my-codebase is a directory with a heterogeneous collection of files (.tsx, .json, .ts, .js, .md)

# Document loading: Load codebase from local directory
from langchain_community.document_loaders import DirectoryLoader

project_path = "my-codebase"

loader = DirectoryLoader(project_path, use_multithreading=False)

my_codebase_data = loader.load()

This creates the following error:

{
    "name": "ValueError",
    "message": "Detected a JSON file that does not conform to the Unstructured schema. partition_json currently only processes serialized Unstructured output.",
    "stack": "---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[16], line 8
      4 project_path = \"my-codebase\"
      6 loader = DirectoryLoader(project_path, use_multithreading=False)
----> 8 my_codebase_data = loader.load()

File ~/Repos/chain-repo/.venv/lib64/python3.11/site-packages/langchain_community/document_loaders/directory.py:157, in DirectoryLoader.load(self)
    155 else:
    156     for i in items:
--> 157         self.load_file(i, p, docs, pbar)
    159 if pbar:
    160     pbar.close()

File ~/Repos/chain-repo/.venv/lib64/python3.11/site-packages/langchain_community/document_loaders/directory.py:106, in DirectoryLoader.load_file(self, item, path, docs, pbar)
    104         logger.warning(f\"Error loading file {str(item)}: {e}\")
    105     else:
--> 106         raise e
    107 finally:
    108     if pbar:

File ~/Repos/chain-repo/.venv/lib64/python3.11/site-packages/langchain_community/document_loaders/directory.py:100, in DirectoryLoader.load_file(self, item, path, docs, pbar)
     98 try:
     99     logger.debug(f\"Processing file: {str(item)}\")
--> 100     sub_docs = self.loader_cls(str(item), **self.loader_kwargs).load()
    101     docs.extend(sub_docs)
    102 except Exception as e:

File ~/Repos/chain-repo/.venv/lib64/python3.11/site-packages/langchain_community/document_loaders/unstructured.py:87, in UnstructuredBaseLoader.load(self)
     85 def load(self) -> List[Document]:
     86     \"\"\"Load file.\"\"\"
---> 87     elements = self._get_elements()
     88     self._post_process_elements(elements)
     89     if self.mode == \"elements\":

File ~/Repos/chain-repo/.venv/lib64/python3.11/site-packages/langchain_community/document_loaders/unstructured.py:173, in UnstructuredFileLoader._get_elements(self)
    170 def _get_elements(self) -> List:
    171     from unstructured.partition.auto import partition
--> 173     return partition(filename=self.file_path, **self.unstructured_kwargs)

File ~/Repos/chain-repo/.venv/lib64/python3.11/site-packages/unstructured/partition/auto.py:480, in partition(filename, content_type, file, file_filename, url, include_page_breaks, strategy, encoding, paragraph_grouper, headers, skip_infer_table_types, ssl_verify, ocr_languages, languages, detect_language_per_element, pdf_infer_table_structure, pdf_extract_images, pdf_extract_element_types, pdf_image_output_dir_path, pdf_extract_to_payload, xml_keep_tags, data_source_metadata, metadata_filename, request_timeout, hi_res_model_name, model_name, **kwargs)
    478 elif filetype == FileType.JSON:
    479     if not is_json_processable(filename=filename, file=file):
--> 480         raise ValueError(
    481             \"Detected a JSON file that does not conform to the Unstructured schema. \"
    482             \"partition_json currently only processes serialized Unstructured output.\",
    483         )
    484     elements = partition_json(filename=filename, file=file, **kwargs)
    485 elif (filetype == FileType.XLSX) or (filetype == FileType.XLS):

ValueError: Detected a JSON file that does not conform to the Unstructured schema. partition_json currently only processes serialized Unstructured output."
}

Expected behavior

To get the expected behavior, set use_multithreading to True:

loader = DirectoryLoader(project_path, use_multithreading=True)

Doing this loads the files without error.

Curiously, I get the same loader if I just set silent_errors to True:

loader = DirectoryLoader(project_path, use_multithreading=False, silent_errors=True)

In this case, the error is printed, but the execution is not halted.

Curiously, if I set use_multithreading to True and have silent_errors set to True, I get the same behaviour as for use_multithreading=False. This time it acknowledges that there are errors, where as if it is silent, it just ignores them and doesn't even print them.

loader = DirectoryLoader(project_path, use_multithreading=True, silent_errors=True)

Additional thoughts

dosubot[bot] commented 8 months ago

πŸ€–

Hi @cgalo5758! I'm Dosu, a bot here to assist you with debugging, answer your queries, and guide you to contribute efficiently to the LangChain repository. While we wait for a human maintainer to join us, I'm at your service to help with any issues. Let's get this sorted out together, shall we?

Thank you for providing the detailed information. The issue you're experiencing seems to be related to how the DirectoryLoader handles exceptions and file loading when the use_multithreading parameter is set to False.

When use_multithreading is set to True, the load_file method is called using the map function of the ThreadPoolExecutor, which automatically handles exceptions. However, when use_multithreading is set to False, the load_file method is called directly in a for loop, which does not handle exceptions. If an exception is raised in the load_file method, it will not be caught and will cause the program to terminate with a ValueError.

Here is the relevant code:

if self.use_multithreading:
    with concurrent.futures.ThreadPoolExecutor(
        max_workers=self.max_concurrency
    ) as executor:
        executor.map(lambda i: self.load_file(i, p, docs, pbar), items)
else:
    for i in items:
        self.load_file(i, p, docs, pbar)

As for the recursive parameter, the DirectoryLoader goes through each subdirectory even when the recursive parameter is set to False because of the glob pattern used. The glob pattern "*/[!.]" matches all files in the directory and its subdirectories, excluding hidden files. This pattern is used in the rglob method when the recursive parameter is set to True and in the glob method when the recursive parameter is set to False. Both methods return all files in the directory and its subdirectories that match the glob pattern.

Here is the relevant code:

items = list(p.rglob(self.glob) if self.recursive else p.glob(self.glob))

Regarding the handling of different file types, the DirectoryLoader class in LangChain does not handle different file types differently. It uses the loader_cls parameter to determine how to load the files. By default, this is set to UnstructuredFileLoader, which means it treats all files as unstructured text files.

If you want to handle different file types differently, you would need to create different loader classes (like TextLoader or BSHTMLLoader) and pass them as the loader_cls parameter when creating a DirectoryLoader instance.

I hope this helps! If you have any other questions, feel free to ask.

Best, Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

cgalo5758 commented 4 months ago

This is still a bug. Even if it is small.