run-llama / llama-hub

A library of data loaders for LLMs made by the community -- to be used with LlamaIndex and/or LangChain
https://llamahub.ai/
MIT License
3.42k stars 728 forks source link

[Bug]: SharePointReader fails to load file directory #901

Open jamiesun opened 5 months ago

jamiesun commented 5 months ago

Bug Description


loader = SharePointLoader(
            client_id = os.environ.get("TEST_APP_CLIENT_ID"),
            client_secret = os.environ.get("TEST_APP_CLIENT_SECRET"),
            tenant_id = os.environ.get("TEST_TENENT_ID")
            )

documents = loader.load_data(
            sharepoint_site_name ="GPT",
            sharepoint_folder_path= "Python",
            recursive = True,
)

An error occurred while accessing SharePoint: {'code': 'itemNotFound', 'message': 'The resource could not be found.'}

Version

main

Steps to Reproduce

sharepoint config

image

Relevant Logs/Tracbacks

No response

arun04cbe commented 5 months ago

@jamiesun could you check if the permissions are set as mentioned here - https://llamahub.ai/l/microsoft_sharepoint?from=loaders

brandon-vidoori commented 5 months ago

Same issue. Permissions are set up correctly in Azure/SharePoint

arun04cbe commented 5 months ago

@jamiesun or @brandon-vidoori could you confirm whether you were trying to access only folders present in the documents component of the sharepoint site and not other components like pages or site contents

brandon-vidoori commented 5 months ago

Yes, I am only trying to access folders/documents present in the documents folder. It seems to fail on the graph search for SharePoint site specifically the query for the site name returns nothing.

https://github.com/run-llama/llama-hub/blob/01400bf31ed336137e36caed6809e48bad1c3621/llama_hub/microsoft_sharepoint/base.py#L91

rupache commented 5 months ago

I am also getting the same error: An error occurred while accessing SharePoint: {'code': 'itemNotFound', 'message': 'The resource could not be found.'}

for doc in documents: TypeError: 'NoneType' object is not iterable

jamiesun commented 5 months ago

@arun04cbe @brandon-vidoori

I changed a site without changing any code, and the code executed successfully; I changed other sites again, and it worked.

I feel a little strange, this failed site name is GPT, I don't know if it has something to do with this name, I used a mixture of English and Chinese sentences when I created the site again, the system automatically generates GPT as the site name.

The sites that I succeeded in executing were all single English word site names without exception.

I'm not sure if it's a problem with sharepoint itself

brandon-vidoori commented 5 months ago

@jamiesun @arun04cbe

The sites that I succeeded in executing were all single English word site names without exception.

The site I have been testing with is like “Data Science” so that might be causing the issue. Will try with a site named “Data” to see if that succeeds.

The documentation for the Graph REST API search sites is not clear on expected behavior for a partial match and would seem to suggest a search for “Data” would return sites named “Data Science” and “Data Management”. GetSite seems more appropriate for requiring exact match so the inflexibility we are noticing is bizarre to say the least.

arun04cbe commented 5 months ago

I too faced this issue. I am reading through the msft documentations for the fixes. The current loader is only application based but we also need user based loader, which is well I am working on it. Will post the updates here post fix.

rupache commented 5 months ago

Can we store indexes in the SharePoint document library itself for persistence? That was the data will be secure within the same domain.

brandon-vidoori commented 5 months ago

@jamiesun @arun04cbe

SharePointReader returned the same with a single word English named site like "Data" from my previous example.

I know have set up permissions correctly because I can debug SharePointReader locally, set breakpoint and step through code until the access_token is generated, then use that same access_token in postman with GET https://graph.microsoft.com/v1.0/sites?search=Data and success. Not only do I get Data site but I also get DataScience site that I previously created.

arun04cbe commented 5 months ago

@brandon-vidoori Thanks for pointing out the exact problem. Will look to fix this up.

brandon-vidoori commented 5 months ago

@arun04cbe

I eventually got it working.

This seems like less of a bug and more so the documentation on config variables could be more clear. Provided permissions are set up correctly consider the following:

sharepoint_site_name - is just the name of the site like “Data Science” or “Data”.

sharepoint_folder_path - is just the name of any top level folder in Documents like “Tests”. If you add “Documents/Tests” or “/Tests” it will fail. Only the folder name. Note: I only tested with recursive = True