simonw / google-drive-to-sqlite

Create a SQLite database containing metadata from Google Drive
https://datasette.io/tools/google-drive-to-sqlite
Apache License 2.0
153 stars 13 forks source link

google-drive-to-sqlite initial release #1

Closed simonw closed 2 years ago

simonw commented 2 years ago

Basic design:

google-drive-to-sqlite files google.db FOLDER_ID

Looks for auth.json with credentials in the current directory - it wants a refresh_token or an access_token of some sort.

Features for the initial release:

simonw commented 2 years ago

Authentication will be tricky. For the moment I'll go with the simplest thing possible, but I may need to build a google-drive-to-sqlite auth command just to get things up and running.

simonw commented 2 years ago

I prototype this in a Notebook - here are the important bits:

google_client_id = '99025868001-1rdn5dj68t4m60ommr5kb151p038fn6v.apps.googleusercontent.com'
# Is it safe to bake the secret into a distributed CLI app?
google_client_secret = '...'

scope = "https://www.googleapis.com/auth/drive.readonly"
auth_url = "https://accounts.google.com/o/oauth2/v2/auth?access_type=offline&client_id={}&redirect_uri=urn:ietf:wg:oauth:2.0:oob&response_type={}&scope={}".format(
    google_client_id, "code", scope
)
print(auth_url)

The user then clicks on that URL, grants permissions and copies a code from the resulting page.

copied_code = '4/1A...'

import httpx
response = httpx.post("https://www.googleapis.com/oauth2/v4/token", data={
    "code": copied_code,
    "client_id": google_client_id,
    "client_secret": google_client_secret,
    "redirect_uri": "urn:ietf:wg:oauth:2.0:oob",
    "grant_type": "authorization_code",
})
tokens = response.json()

# tokens now has a `"access_token"` key that can be used directly, and a `"refresh_token"`
# that can be exchanged for another access token later on

The trickiest thing to figure out was how to grab all of the files in a particular folder.

Turns out you need the folder ID, then you can run a search for q of "folder_id" in parents.

You have to then do that recursively for every child folder - there's no equivalent of find . in the API.

Was also hard to figure out the fields= parameter for returning specific fields. The syntax for that turned out to be:

nextPageToken, files(id,name,kind,...)
fields = [
    "kind", "id", "name", "mimeType", "starred", "trashed", "explicitlyTrashed",
    "parents", "spaces", "version", "webViewLink", "iconLink", "hasThumbnail",
    "thumbnailVersion", "viewedByMe", "createdTime", "modifiedTime", "modifiedByMe",
    "owners", "lastModifyingUser", "shared", "ownedByMe", "viewersCanCopyContent",
    "copyRequiresWriterPermission", "writersCanShare", "folderColorRgb",
    "quotaBytesUsed", "isAppAuthorized", "linkShareMetadata"
]

def files_in_folder(folder_id):
    pageToken=None
    files_url = 'https://www.googleapis.com/drive/v3/files'
    params = {
        "corpora": "user",
        "q": '"{}" in parents'.format(folder_id), 
        "fields": "nextPageToken, files({})".format(",".join(fields))
    }
    while True:
        if pageToken is not None:
            params["pageToken"] = pageToken
        else:
            params.pop("pageToken", None)
        data = httpx.get(files_url, params=params, headers={
            "Authorization": "Bearer {}".format(access_token)
        }).json()
        print(params)
        if "error" in data:
            raise Exception(data)
        yield from data["files"]
        pageToken = data.get("nextPageToken", None)
        if pageToken is None:
            break

def fetch_all_children(table, folder_id, indent=''):
    print(indent, folder_id)
    files = files_in_folder(folder_id)['files']
    table.insert_all(files, pk="id", replace=True)
    print(indent, "  inserted {} files".format(len(files)))
    folders = [f['id'] for f in files if f['mimeType'] == 'application/vnd.google-apps.folder']
    for folder in folders:
        fetch_all_children(table, folder, indent=indent + '  ')

folder_id = "1PhZt8DOcPjIpV09agCEngN-xzViOhrJu"

db = sqlite_utils.Database("/tmp/files.db")
fetch_all_children(db["all_files"], folder_id)
simonw commented 2 years ago

The access tokens only last for 3600 seconds = 1 hour - so I found myself having to obtain a second access token using the refresh token like this:

access_token = httpx.post("https://www.googleapis.com/oauth2/v4/token", data={
    "grant_type": "refresh_token",
    "refresh_token": '1//06...',
    "client_id": google_client_id,
    "client_secret": google_client_secret,
}).json()["access_token"]
simonw commented 2 years ago

Here's the recipe that worked for retrieving the binary contents of a file - the trick is the alt=media parameter:

def get_binary(file_id):
    return httpx.get(
        "https://www.googleapis.com/drive/v3/files/{}?alt=media".format(file_id),
        headers={
            "Authorization": "Bearer {}".format(access_token)
        }
    ).content
simonw commented 2 years ago

I think thumbnailLink should be in that list of fields too. I wish I had a definitive list of fields!

simonw commented 2 years ago

I'm going to have this:

google-drive-to-sqlite files google.db

Retrieve ALL files in the drive.

google-drive-to-sqlite files google.db --folder FOLDER_ID

Will do just the files in that folder - using ?q= and "folder_id" in parents - but applied recursively to all of the sub-folders.

google-drive-to-sqlite files google.db --q search_term

Will allow advanced search terms (passed to ?q=).

simonw commented 2 years ago

Here's the docs for ?q=: https://developers.google.com/drive/api/v3/search-files

simonw commented 2 years ago

Might add other options that directly reflect the API: https://developers.google.com/drive/api/v3/reference/files/list

simonw commented 2 years ago

I should investigate https://developers.google.com/drive/api/v3/file a bit more - it talks about automatically generated full text (extracted from PDF etc) and thumbnails.

simonw commented 2 years ago

I should investigate https://developers.google.com/drive/api/v3/file a bit more - it talks about automatically generated full text (extracted from PDF etc) and thumbnails.

Annoyingly it looks like indexableText is a write-only property - you can provide your own custom version of it, but there doesn't seem to be a way to extract the automatically derived stuff.

For thumbnails, thumbnailLink may be present and, if so, will be a possibly time-limited link to a thumbnail image - e.g. https://docs.google.com/feeds/vt?gd=true&id=1YEsITp_X8PtDUJWHGM0osT-TXAU1nr0e7RSWRM2Jpyg&v=31&s=AMedNnoAAAAAYgy9-L4oJgCv8_vOgmJWqqhNzgz5rCMR&sz=s220 links to this image:

image

15K image.

But... it only works if I am signed in - it returns a 404 otherwise.

simonw commented 2 years ago

Knock off the sz=s220 from that URL and you get: https://docs.google.com/feeds/vt?gd=true&id=1YEsITp_X8PtDUJWHGM0osT-TXAU1nr0e7RSWRM2Jpyg&v=31&s=AMedNnoAAAAAYgy9-L4oJgCv8_vOgmJWqqhNzgz5rCMR

image

It's a 64K image.

simonw commented 2 years ago

I'm releasing this as a alpha because I don't yet have the OAuth credentials verified such that anyone other than me can actually use it!

simonw commented 2 years ago

Released: https://pypi.org/project/google-drive-to-sqlite/