Closed simonw closed 2 years ago
Authentication will be tricky. For the moment I'll go with the simplest thing possible, but I may need to build a google-drive-to-sqlite auth
command just to get things up and running.
I prototype this in a Notebook - here are the important bits:
google_client_id = '99025868001-1rdn5dj68t4m60ommr5kb151p038fn6v.apps.googleusercontent.com'
# Is it safe to bake the secret into a distributed CLI app?
google_client_secret = '...'
scope = "https://www.googleapis.com/auth/drive.readonly"
auth_url = "https://accounts.google.com/o/oauth2/v2/auth?access_type=offline&client_id={}&redirect_uri=urn:ietf:wg:oauth:2.0:oob&response_type={}&scope={}".format(
google_client_id, "code", scope
)
print(auth_url)
The user then clicks on that URL, grants permissions and copies a code from the resulting page.
copied_code = '4/1A...'
import httpx
response = httpx.post("https://www.googleapis.com/oauth2/v4/token", data={
"code": copied_code,
"client_id": google_client_id,
"client_secret": google_client_secret,
"redirect_uri": "urn:ietf:wg:oauth:2.0:oob",
"grant_type": "authorization_code",
})
tokens = response.json()
# tokens now has a `"access_token"` key that can be used directly, and a `"refresh_token"`
# that can be exchanged for another access token later on
The trickiest thing to figure out was how to grab all of the files in a particular folder.
Turns out you need the folder ID, then you can run a search for q
of "folder_id" in parents
.
You have to then do that recursively for every child folder - there's no equivalent of find .
in the API.
Was also hard to figure out the fields=
parameter for returning specific fields. The syntax for that turned out to be:
nextPageToken, files(id,name,kind,...)
fields = [
"kind", "id", "name", "mimeType", "starred", "trashed", "explicitlyTrashed",
"parents", "spaces", "version", "webViewLink", "iconLink", "hasThumbnail",
"thumbnailVersion", "viewedByMe", "createdTime", "modifiedTime", "modifiedByMe",
"owners", "lastModifyingUser", "shared", "ownedByMe", "viewersCanCopyContent",
"copyRequiresWriterPermission", "writersCanShare", "folderColorRgb",
"quotaBytesUsed", "isAppAuthorized", "linkShareMetadata"
]
def files_in_folder(folder_id):
pageToken=None
files_url = 'https://www.googleapis.com/drive/v3/files'
params = {
"corpora": "user",
"q": '"{}" in parents'.format(folder_id),
"fields": "nextPageToken, files({})".format(",".join(fields))
}
while True:
if pageToken is not None:
params["pageToken"] = pageToken
else:
params.pop("pageToken", None)
data = httpx.get(files_url, params=params, headers={
"Authorization": "Bearer {}".format(access_token)
}).json()
print(params)
if "error" in data:
raise Exception(data)
yield from data["files"]
pageToken = data.get("nextPageToken", None)
if pageToken is None:
break
def fetch_all_children(table, folder_id, indent=''):
print(indent, folder_id)
files = files_in_folder(folder_id)['files']
table.insert_all(files, pk="id", replace=True)
print(indent, " inserted {} files".format(len(files)))
folders = [f['id'] for f in files if f['mimeType'] == 'application/vnd.google-apps.folder']
for folder in folders:
fetch_all_children(table, folder, indent=indent + ' ')
folder_id = "1PhZt8DOcPjIpV09agCEngN-xzViOhrJu"
db = sqlite_utils.Database("/tmp/files.db")
fetch_all_children(db["all_files"], folder_id)
The access tokens only last for 3600 seconds = 1 hour - so I found myself having to obtain a second access token using the refresh token like this:
access_token = httpx.post("https://www.googleapis.com/oauth2/v4/token", data={
"grant_type": "refresh_token",
"refresh_token": '1//06...',
"client_id": google_client_id,
"client_secret": google_client_secret,
}).json()["access_token"]
Here's the recipe that worked for retrieving the binary contents of a file - the trick is the alt=media
parameter:
def get_binary(file_id):
return httpx.get(
"https://www.googleapis.com/drive/v3/files/{}?alt=media".format(file_id),
headers={
"Authorization": "Bearer {}".format(access_token)
}
).content
I think thumbnailLink
should be in that list of fields too. I wish I had a definitive list of fields!
I'm going to have this:
google-drive-to-sqlite files google.db
Retrieve ALL files in the drive.
google-drive-to-sqlite files google.db --folder FOLDER_ID
Will do just the files in that folder - using ?q=
and "folder_id" in parents
- but applied recursively to all of the sub-folders.
google-drive-to-sqlite files google.db --q search_term
Will allow advanced search terms (passed to ?q=
).
Here's the docs for ?q=
: https://developers.google.com/drive/api/v3/search-files
Might add other options that directly reflect the API: https://developers.google.com/drive/api/v3/reference/files/list
I should investigate https://developers.google.com/drive/api/v3/file a bit more - it talks about automatically generated full text (extracted from PDF etc) and thumbnails.
I should investigate https://developers.google.com/drive/api/v3/file a bit more - it talks about automatically generated full text (extracted from PDF etc) and thumbnails.
Annoyingly it looks like indexableText
is a write-only property - you can provide your own custom version of it, but there doesn't seem to be a way to extract the automatically derived stuff.
For thumbnails, thumbnailLink
may be present and, if so, will be a possibly time-limited link to a thumbnail image - e.g. https://docs.google.com/feeds/vt?gd=true&id=1YEsITp_X8PtDUJWHGM0osT-TXAU1nr0e7RSWRM2Jpyg&v=31&s=AMedNnoAAAAAYgy9-L4oJgCv8_vOgmJWqqhNzgz5rCMR&sz=s220 links to this image:
15K image.
But... it only works if I am signed in - it returns a 404 otherwise.
Knock off the sz=s220
from that URL and you get: https://docs.google.com/feeds/vt?gd=true&id=1YEsITp_X8PtDUJWHGM0osT-TXAU1nr0e7RSWRM2Jpyg&v=31&s=AMedNnoAAAAAYgy9-L4oJgCv8_vOgmJWqqhNzgz5rCMR
It's a 64K image.
I'm releasing this as a alpha because I don't yet have the OAuth credentials verified such that anyone other than me can actually use it!
Basic design:
Looks for
auth.json
with credentials in the current directory - it wants arefresh_token
or anaccess_token
of some sort.Features for the initial release: