vgrem / Office365-REST-Python-Client

Microsoft 365 & Microsoft Graph Library for Python
MIT License
1.24k stars 323 forks source link

How can I download SharePoint folder containing multiple files? #94

Closed AakashBasu closed 5 years ago

AakashBasu commented 5 years ago

My Python 3 code:

from office365.runtime.auth.authentication_context import AuthenticationContext from office365.sharepoint.client_context import ClientContext

url = 'https://company.sharepoint.com/sites/abc' ctx_auth = AuthenticationContext(url=url) if ctx_auth.acquire_token_for_user(username='abcd.xyz@company.com', password='12345'): ctx = ClientContext(url, ctx_auth) lists = ctx.web.lists ctx.load(lists) ctx.execute_query() for l in lists: print(l.properties['Title'])

From the above code, I can list the items in the site. But my plan is to run this entire module in AWS Lambda using Python and download from SharePoint Documents and store in AWS S3.

A folder can have multiple files. I want to download the entire folder with all the files. Anyone did this? Any help? A working code shall be a great help as I am totally new to web scraping!

Bachatero commented 5 years ago

Hi, perhaps you could do it in a loop, e.g.:

  1. return sharepoint Documents library contents first using a function:

listTitle = "Documents" site = "abc"

def fncPrintLibraryContents(ctx, listTitle):

try:

    list_object = ctx.web.lists.get_by_title(listTitle)
    folder = list_object.root_folder        
    ctx.load(folder)
    ctx.execute_query()

    files = folder.files
    ctx.load(files)
    ctx.execute_query()

    return files

except:

    print('Problem printing out library contents')   
    sys.exit(1)
  1. then download each file by calling a proc, e.g.:

def downloadFile(ctx, fileName):

try:
    with open(fileName, "wb") as localFile:            
        relativeUrl = '/sites/{0}/Shared%20Documents/{1}'.format(site, fileName)
        response = File.open_binary(ctx, relativeUrl)
        localFile.write(response.content) 
        localFile.close()

except:

    print('Problem downloading file:', fileName)
    sys.exit(1)

myfiles = fncPrintLibraryContents(ctx, listTitle)

for myfile in myfiles: print("Downloading file: {0}".format(myfile.properties["Name"])) downloadFile(ctx,` myfile.properties["Name"])

m.

Bachatero commented 5 years ago

pls, indent last two lines in the for loop, I can't seem to do it. m.

AakashBasu commented 5 years ago

Hey,

Thanks for such a quick reply. I am being able to successfully download the files, given, I have to give till the file name. But, to be able to recursively download all the files, I need to first list all the existing ones in a particular folder which after several trials, getting Not Found errors. Maybe I am going wrong somewhere, because my concept of Title is not right, so whenever I am trying to list a subfolder by giving that name as a title, I fail. I will go through your code and see if I am able to do it.

Meanwhile, my current running code (Downloading works fine, listing folders and files for root is working but whenever in Title I am giving any specific folder name other than Documents, it fails):

`from office365.runtime.auth.authentication_context import AuthenticationContext from office365.sharepoint.client_context import ClientContext from office365.sharepoint.file import File from office365.sharepoint.file_creation_information import FileCreationInformation

def read_folder_and_files(context, list_title): """Read a folder example""" list_obj = context.web.lists.get_by_title(list_title) folder = list_obj.root_folder context.load(folder) context.execute_query() print("List url: {0}".format(folder.properties["ServerRelativeUrl"]))

files = folder.files
context.load(files)
context.execute_query()
for cur_file in files:
    print("File name: {0}".format(cur_file.properties["Name"]))

folders = context.web.folders
context.load(folders)
context.execute_query()
for folder in folders:
    print("Folder name: {0}".format(folder.properties["Name"]))

def download_file(context): response = File.open_binary(context, "/sites/new/Shared Documents/2011-A/file1.csv") print(response) print(response.content) with open(r"C:\Users\aakashb\Downloads\test\file1.csv", "wb") as local_file: local_file.write(response.content)

ctx = None url = 'https://company.sharepoint.com/sites/new' ctx_auth = AuthenticationContext(url=url) if ctx_auth.acquire_token_for_user(username='name.surname@company.com', password='12345'): ctx = ClientContext(url, ctx_auth) read_folder_and_files(ctx, 'Documents')

print('entering function')

download_file(ctx)

print('exiting function')`

AakashBasu commented 5 years ago

1) Sorry for the broken structure of my code I gave you. 2) Just ran your code and checked, it is doing exactly what my code is doing in terms of listing. It is listing the files in the root (not inside any folder). But I want to do the same for folders. 3) I also want to list the folders. When I use @vgrem 's code of listing folders, it is not showing me the folders of the Documents, but showing folders like:

Folder name: SitePages Folder name: Style Library Folder name: _catalogs Folder name: FormServerTemplates Folder name: _private Folder name: Sharing Links Folder name: SiteAssets Folder name: images Folder name: Shared Documents Folder name: Lists Folder name: _cts

Which are none of the folders I have in the SharePoint Doc Lib.

So, in short, how can I list Doc Lib folders and their respective files to be downloaded?

Bachatero commented 5 years ago

Hi,
please look at the issue here: https://github.com/vgrem/Office365-REST-Python-Client/issues/91 specifically at the line that goes like this:

folder = ctx.web.get_folder_by_server_relative_url(app_settings['urlrel'])

If it won't help then I'll get back to you to provide more details. m.

Bachatero commented 5 years ago

... what I meant was using get_folder_by_server_relative_url method instead of get_by_title, e.g.

app_settings = {'urlrel': '/sites/abc/Shared Documents/TEST'}

def printFolderContents(ctx, listTitle):

try:

    #list_object = ctx.web.lists.get_by_title(listTitle)
    folder = ctx.web.get_folder_by_server_relative_url(app_settings['urlrel'])
    #folder = list_object.root_folder        
    ctx.load(folder)
    ctx.execute_query()
    #print(folder.url)

    files = folder.files
    ctx.load(files)
    ctx.execute_query()

    for myfile in files:
        print("File name: {0}".format(myfile.properties["Name"]))

except:

    print('Problem printing out library contents')   
    sys.exit(1)

Let me know if that helps ...

Bachatero commented 5 years ago

to download the files inside TEST folder within Shared Documents library you can for instance alter the above code to make it a function, such as:

def fncGetFolderContents(ctx, listTitle):

try:

    #list_object = ctx.web.lists.get_by_title(listTitle)
    folder = ctx.web.get_folder_by_server_relative_url(app_settings['urlrel'])
    #folder = list_object.root_folder        
    ctx.load(folder)
    ctx.execute_query()
    #print(folder.url)

    files = folder.files
    ctx.load(files)
    ctx.execute_query()

    #for myfile in files:
    #    print("File name: {0}".format(myfile.properties["Name"]))

    return files

except:

    print('Problem printing out library contents')   
    sys.exit(1)

and alter the download function a little, e.g:

def downloadFolderFile(ctx, fileName):

try:
    with open(fileName, "wb") as localFile:            
        relativeUrl = '/sites/{0}/Shared%20Documents/{1}/{2}'.format(site, yourFolder, fileName)
        #relativeUrl = app_settings['urlrel']
        response = File.open_binary(ctx, relativeUrl)
        localFile.write(response.content) 
        localFile.close()

except:

    print('Problem downloading file:', fileName)
    sys.exit(1)

myfiles = fncGetFolderContents(ctx, listTitle)

for myfile in myfiles: print("Downloading file: {0}".format(myfile.properties["Name"])) downloadFolderFile(ctx, myfile.properties["Name"])

AakashBasu commented 5 years ago

Thanks a lot man! The two of you are really prompt in replies, as well as the API is absolutely awesome!

I will go through it ASAP and try to replicate. But, is there a way to list the folders? I mean, the latest code you gave will work when I know the folder name. In case I automate the process and new folder is created and files are kept, it won't work for the new folder, right? That's why I also wanted listing folder, just in-case. Anyway, the present solution should work for my use-case.

Lot of thanks to both of you. I will update here, once I run the experiment.

Bachatero commented 5 years ago

Don't thank me, @vgrem is to blame :) ... and I'm not sure, maybe there are other ways of achieving the same ....

right, to list all the folders inside Shared Documents document library you may try:

    list_object = ctx.web.lists.get_by_title(listTitle)
    folder = list_object.root_folder        
    ctx.load(folder)
    ctx.execute_query()

    folders = folder.folders
    ctx.load(folders)
    ctx.execute_query()

    for myfolder in folders:
        print("File name: {0}".format(myfolder.properties["Name"]))

m.

AakashBasu commented 5 years ago

Fantastic. Iterative folder content printing and download worked!

Thank you,

mamonovayuliya commented 3 years ago

This code downloads corrupted pdf files. THey are empty - 156 bytes. Any ideas why?

shivparashar1984 commented 3 years ago

I am also getting corrupted pdf files with only 1kb filename by using above cosde. Any idea?

mamonovayuliya commented 3 years ago

I am also getting corrupted pdf files with only 1kb filename by using above code. Any idea?

I figured it out, for me the reason was the relative url. When I need to list folder content, I don't need to add /sites/sitename/library etc., it just has to be /library. But when I am downloading the files already, I need to add /sites/sitename/folder/file.

This is really weird, because I still can access and download files without adding /sites/sitename/, but the content is corrupted then. At the same time, if I add /sites/sitename/ when I am getting folder content, it throws an error, and only works if I start relative url with a library.

It is weird that every single resource suggests to add /sites/sitename to relative url for both folder content and file content.

shivparashar1984 commented 3 years ago

Thanks for suggestion. can you share final working code . If we want to download all contents of subfolder like /sites/sitename/Documents/somefolder then what would be final code?

sudharpr commented 3 years ago

Thanks guys. This helps solve a lot of problems and issues faced while using the Sharepoint package.

Amit1234Agrawal commented 1 year ago

Hi Friends,

Do you have any idea, how to download large csv files larger than 10GB in small chunk. because AWS lambda can't handle large files like this.

If possible, share the code snippet as well.

Thanks in advance!