langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
88.64k stars 13.93k forks source link

SharepointLoader not working as intended despite latest merge 'propagation of document metadata from O365BaseLoader' #22663

Open radvanyimome opened 4 weeks ago

radvanyimome commented 4 weeks ago

Checked other resources

Example Code


from langchain_community.document_loaders.sharepoint import SharePointLoader

# O365_CLIENT_ID, O365_CLIENT_SECRET included in the environment
# first 'manual' authentication was successful throwing the same error as included below

loader = SharePointLoader(document_library_id=<LIBRARY_ID>, recursive=True, auth_with_token=False)
documents = loader.load()

Error Message and Stack Trace (if applicable)

ValueError                                Traceback (most recent call last)
Cell In[21], line 14
     11 documents = loader.lazy_load()
     13 # Process each document
---> 14 for doc in documents:
     15     try:
     16         # Ensure MIME type is available or set a default based on file extension
     17         if 'mimetype' not in doc.metadata or not doc.metadata['mimetype']:

File ~/.local/lib/python3.11/site-packages/langchain_community/document_loaders/sharepoint.py:86, in SharePointLoader.lazy_load(self)
     84     raise ValueError("Unable to fetch root folder")
     85 for blob in self._load_from_folder(target_folder):
---> 86     for blob_part in blob_parser.lazy_parse(blob):
     87         blob_part.metadata.update(blob.metadata)
     88         yield blob_part

File ~/.local/lib/python3.11/site-packages/langchain_community/document_loaders/parsers/generic.py:61, in MimeTypeBasedParser.lazy_parse(self, blob)
     58 mimetype = blob.mimetype
     60 if mimetype is None:
---> 61     raise ValueError(f"{blob} does not have a mimetype.")
     63 if mimetype in self.handlers:
     64     handler = self.handlers[mimetype]

ValueError: data=None mimetype=None encoding='utf-8' path=PosixPath('/tmp/tmp92nu0bdz/test_document_on_SP.docx') metadata={} does not have a mimetype.

Description

System Info

Currently I am running the code on the Unstructured docker container (downloads.unstructured.io/unstructured-io/unstructured:latest) but other Linux platforms like Ubuntu 20.04 and python:3.11-slim were also fruitless. Packages like O365 and PyMuPDF were also installed.

/usr/src/app $ python -m langchain_core.sys_info

System Information

OS: Linux OS Version: #1 SMP Fri Apr 2 22:23:49 UTC 2021 Python Version: 3.11.9 (main, May 23 2024, 20:26:53) [GCC 13.2.0]

Package Information

langchain_core: 0.2.2 langchain: 0.2.1 langchain_community: 0.2.1 langsmith: 0.1.62 langchain_google_vertexai: 1.0.4 langchain_huggingface: 0.0.3 langchain_text_splitters: 0.2.0 langchain_voyageai: 0.1.1

Packages not installed (Not Necessarily a Problem)

The following packages were not found:

langgraph langserve

radvanyimome commented 4 weeks ago

My hunch is that this issue might be related to to commit https://github.com/langchain-ai/langchain/pull/20663 against metadata about the document gets lost during the downloading process to temp storage. I'm not entirely sure of the root cause, but it's a tricky problem that might need more eyes on it. Thanks to @MacanPN for pointing this out! Any insights or further checks we could perform to better understand this would be greatly appreciated.