langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
MIT License
88.64k stars 13.93k forks source link

SharepointLoader not working as intended despite latest merge 'propagation of document metadata from O365BaseLoader' #22663

Open radvanyimome opened 4 weeks ago

radvanyimome commented 4 weeks ago

Checked other resources

Example Code

from langchain_community.document_loaders.sharepoint import SharePointLoader

# O365_CLIENT_ID, O365_CLIENT_SECRET included in the environment
# first 'manual' authentication was successful throwing the same error as included below

loader = SharePointLoader(document_library_id=<LIBRARY_ID>, recursive=True, auth_with_token=False)
documents = loader.load()

Error Message and Stack Trace (if applicable)

ValueError                                Traceback (most recent call last)
Cell In[21], line 14
     11 documents = loader.lazy_load()
     13 # Process each document
---> 14 for doc in documents:
     15     try:
     16         # Ensure MIME type is available or set a default based on file extension
     17         if 'mimetype' not in doc.metadata or not doc.metadata['mimetype']:

File ~/.local/lib/python3.11/site-packages/langchain_community/document_loaders/, in SharePointLoader.lazy_load(self)
     84     raise ValueError("Unable to fetch root folder")
     85 for blob in self._load_from_folder(target_folder):
---> 86     for blob_part in blob_parser.lazy_parse(blob):
     87         blob_part.metadata.update(blob.metadata)
     88         yield blob_part

File ~/.local/lib/python3.11/site-packages/langchain_community/document_loaders/parsers/, in MimeTypeBasedParser.lazy_parse(self, blob)
     58 mimetype = blob.mimetype
     60 if mimetype is None:
---> 61     raise ValueError(f"{blob} does not have a mimetype.")
     63 if mimetype in self.handlers:
     64     handler = self.handlers[mimetype]

ValueError: data=None mimetype=None encoding='utf-8' path=PosixPath('/tmp/tmp92nu0bdz/test_document_on_SP.docx') metadata={} does not have a mimetype.


System Info

Currently I am running the code on the Unstructured docker container ( but other Linux platforms like Ubuntu 20.04 and python:3.11-slim were also fruitless. Packages like O365 and PyMuPDF were also installed.

/usr/src/app $ python -m langchain_core.sys_info

System Information

OS: Linux OS Version: #1 SMP Fri Apr 2 22:23:49 UTC 2021 Python Version: 3.11.9 (main, May 23 2024, 20:26:53) [GCC 13.2.0]

Package Information

langchain_core: 0.2.2 langchain: 0.2.1 langchain_community: 0.2.1 langsmith: 0.1.62 langchain_google_vertexai: 1.0.4 langchain_huggingface: 0.0.3 langchain_text_splitters: 0.2.0 langchain_voyageai: 0.1.1

Packages not installed (Not Necessarily a Problem)

The following packages were not found:

langgraph langserve

radvanyimome commented 4 weeks ago

My hunch is that this issue might be related to to commit against metadata about the document gets lost during the downloading process to temp storage. I'm not entirely sure of the root cause, but it's a tricky problem that might need more eyes on it. Thanks to @MacanPN for pointing this out! Any insights or further checks we could perform to better understand this would be greatly appreciated.