Closed simonw closed 2 years ago
I think I need three tables:
drive_files
- the current files
table with a couple of things extracteddrive_users
- needed for concepts like ownershipdrive_folders
for foldersThe hardest data modeling challenge here is folders. How should hierarchy be represented? Also, can a file be in more than one folder at once? parents
is an array which suggests that it can.
Looks like the multiple parents thing was removed in September 2020 https://cloud.google.com/blog/products/g-suite/simplifying-google-drives-folder-structure-and-sharing-models
Here's what a file looks like:
{
"kind": "drive#file",
"id": "1Xdqfeoi8B8YJJR0y-_oQlHYpjHHzD5a-",
"name": "california_warn_raw.csv",
"mimeType": "text/csv",
"starred": false,
"trashed": false,
"explicitlyTrashed": false,
"parents": [
"113Wb_KLL1dtgx3vpeRfSTOYIUDf3QnnN"
],
"spaces": [
"drive"
],
"version": "2",
"webViewLink": "https://drive.google.com/file/d/1Xdqfeoi8B8YJJR0y-_oQlHYpjHHzD5a-/view?usp=drivesdk",
"iconLink": "https://drive-thirdparty.googleusercontent.com/16/type/text/csv",
"hasThumbnail": false,
"thumbnailVersion": "0",
"viewedByMe": true,
"createdTime": "2022-02-19T04:25:16.517Z",
"modifiedTime": "2020-11-11T18:10:31.000Z",
"modifiedByMe": true,
"owners": [
{
"kind": "drive#user",
"displayName": "Simon Willison",
"photoLink": "https://lh3.googleusercontent.com/a-/AOh14Gg9Loyxove5ocfBp0mg0u2afcTpM1no8QJnwbWnxw=s64",
"me": true,
"permissionId": "16974643384157631322",
"emailAddress": "swillison@gmail.com"
}
],
"lastModifyingUser": {
"kind": "drive#user",
"displayName": "Simon Willison",
"photoLink": "https://lh3.googleusercontent.com/a-/AOh14Gg9Loyxove5ocfBp0mg0u2afcTpM1no8QJnwbWnxw=s64",
"me": true,
"permissionId": "16974643384157631322",
"emailAddress": "...@gmail.com"
},
"shared": false,
"ownedByMe": true,
"viewersCanCopyContent": true,
"copyRequiresWriterPermission": false,
"writersCanShare": true,
"quotaBytesUsed": "1070506",
"isAppAuthorized": false,
"linkShareMetadata": {
"securityUpdateEligible": false,
"securityUpdateEnabled": true
}
}
Despite parents
being a list we know that it should only ever contain a single parent - so we can treat that as a nullable single item.
owners
is a list, so it should become a drive_owners
many-to-many table against drive_users
lastModifyingUser
should be a foreign key against drive_users
.
I'm going to turn parents
into a single parent
that's a foreign key to drive_folders
.
Should a folder be stored in both drive_files
AND drive_folders
? I think not - I think folders should live only in drive_folders
.
Of note: currently this command:
google-drive-to-sqlite files --folder '1dbccBzomcvEUGdnoj8-9QG1yHxS0R-_j' --json
Does NOT retrieve the details of the folder with ID 1dbccBzomcvEUGdnoj8-9QG1yHxS0R-_j
- it only gets its children.
It would be better if it also retrieved the folder itself, using:
google-drive-to-sqlite get \
'https://www.googleapis.com/drive/v3/files/1dbccBzomcvEUGdnoj8-9QG1yHxS0R-_j?fields=*'
(But with the proper list of fields)
Relevant challenge: if a user requests a specific folder and we don't have that folder's parents stored, what do we do about it with regards to foreign keys and the hierarchy?
A few options:
parent
but tolerate it referencing a record that doesn't existparents
key)parent
of nullparent
of null but store an additional column (maybe parent_id
?) which isn't a foreign key but does have the missing reference in itparents
as a JSON array, but don't store parent
I'm leaning towards that last option - so every folder would have a parents
JSON array which is what the API returned, and then most folders would have a parent
foreign key which the tool adds.
Might be a bit confusing though?
Another thing worth considering: supporting efficient "all files and folders within folder X" SQL queries without the user having to run a recursive CTE (since those are hard).
I like the three tree mechanisms implemented in https://django-treebeard.readthedocs.io/ - for this tool I think a materialized path might be best, which could look like this:
_path
: 0AK1CICIR8ECDUk9PVA/113Wb_KLL1dtgx3vpeRfSTOYIUDf3QnnN/1Xdqfeoi8B8YJJR0y-_oQlHYpjHHzD5a-
The _
prefix is to avoid it clashing with path
should they add that to the Google Drive API in the future. I should go with _parent
too if I invent that field.
One catch with this: if I don't recursively fetch the parents all the way up to the root I'll end up storing incomplete paths, which seems both unexpected and breaks the model where you can efficiently query every file in a folder using where _path like '0AK1CICIR8ECDUk9PVA/113Wb_KLL1dtgx3vpeRfSTOYIUDf3QnnN/%'
So maybe I should recursively fetch parents after all?
Also messy: what should I do if the user runs this:
google-drive-to-sqlite files files.db --full-text dogs
And the search returns hundreds of matches, each contained in a different part of the folder hierarchy?
Should the tool also fetch the full parent list for every single one of those matches, in order to populate the drive_folders
table and correctly populate the full _path
hierarchy for every file? That could add up to a LOT of extra queries - potentially 3 or 4 extras per file that came back in the search results, depending on how deep and differing the file hierarchies are.
Solution: only populate _path
and _parent
if it is efficient to do so.
This is true for the --folder
case because it's already walking the hierarchy in a way that will cover every one of the options (bar a few additional queries to find the parents of the starting folder).
For other searches I can leave those fields as null
- unless the user adds a --populate-folders
option in which case we can do all of that extra work because they opted in to it.
I'm actually now leaning towards populating _parent
as a foreign key that isn't enforced, to keep things a little bit simpler - at the expense of maybe making queries harder?
I'm going to try that first and see how well it works.
Prototype worked pretty nicely:
diff --git a/google_drive_to_sqlite/cli.py b/google_drive_to_sqlite/cli.py
index da8fc21..cc91298 100644
--- a/google_drive_to_sqlite/cli.py
+++ b/google_drive_to_sqlite/cli.py
@@ -302,8 +302,32 @@ def files(database, auth, folder, q, full_text, json_, nl, stop_after):
db = sqlite_utils.Database(database)
# Commit every 100 records
for chunk in chunks(all, 100):
+ # Add `_parent` columns
+ files = []
+ folders = []
+ for file in chunk:
+ file["_parent"] = file["parents"][0] if file.get("parents") else None
+ if file["mimeType"] == "application/vnd.google-apps.folder":
+ folders.append(file)
+ else:
+ files.append(file)
with db.conn:
- db["files"].insert_all(chunk, pk="id", replace=True)
+ if not db["drive_folders"].exists():
+ db["drive_folders"].create({"id": str}, pk="id")
+ db["drive_folders"].insert_all(
+ folders,
+ pk="id",
+ replace=True,
+ alter=True,
+ foreign_keys=(("_parent", "drive_folders", "id"),),
+ )
+ db["drive_files"].insert_all(
+ files,
+ pk="id",
+ replace=True,
+ alter=True,
+ foreign_keys=(("_parent", "drive_folders", "id"),),
+ )
def load_tokens(auth):
Started experimenting with a recursive CTE but I don't think I've got it quite right yet:
with recursive folders as (
select id, name, _parent
from drive_folders
where _parent is null
union
select id, name, _parent || '/' || drive_folders._parent
from drive_folders
)
select * from folders
Latest prototype - I think I'm ready to turn this into a tested, documented feature now:
diff --git a/google_drive_to_sqlite/cli.py b/google_drive_to_sqlite/cli.py
index da8fc21..99102af 100644
--- a/google_drive_to_sqlite/cli.py
+++ b/google_drive_to_sqlite/cli.py
@@ -8,7 +8,7 @@ import sqlite_utils
import sys
import textwrap
import urllib.parse
-from .utils import APIClient, files_in_folder_recursive, paginate_files
+from .utils import APIClient, get_file, files_in_folder_recursive, paginate_files
# https://github.com/simonw/google-drive-to-sqlite/issues/2
GOOGLE_CLIENT_ID = (
@@ -299,11 +299,58 @@ def files(database, auth, folder, q, full_text, json_, nl, stop_after):
for line in stream_indented_json(all):
click.echo(line)
return
+
+ if folder:
+ # Fetch details of that folder first
+ folder_details = get_file(client, folder, fields=DEFAULT_FIELDS)
+ old_all = all
+
+ def _all():
+ yield folder_details
+ yield from old_all
+
+ all = _all()
+
db = sqlite_utils.Database(database)
+ save_files_and_folders(db, all)
+
+
+def save_files_and_folders(db, all):
# Commit every 100 records
for chunk in chunks(all, 100):
+ # Add `_parent` columns
+ files = []
+ folders = []
+ for file in chunk:
+ file["_parent"] = file["parents"][0] if file.get("parents") else None
+ if file["mimeType"] == "application/vnd.google-apps.folder":
+ folders.append(file)
+ else:
+ files.append(file)
with db.conn:
- db["files"].insert_all(chunk, pk="id", replace=True)
+ for table in ("drive_folders", "drive_files"):
+ if not db[table].exists():
+ db[table].create(
+ {"id": str, "_parent": str},
+ pk="id",
+ )
+ # Gotta add foreign key after table is created, to avoid
+ # AlterError: No such column: drive_folders.id
+ db[table].add_foreign_key("_parent", "drive_folders", "id")
+ db["drive_folders"].insert_all(
+ folders,
+ pk="id",
+ replace=True,
+ alter=True,
+ foreign_keys=(("_parent", "drive_folders", "id"),),
+ )
+ db["drive_files"].insert_all(
+ files,
+ pk="id",
+ replace=True,
+ alter=True,
+ foreign_keys=(("_parent", "drive_folders", "id"),),
+ )
def load_tokens(auth):
diff --git a/google_drive_to_sqlite/utils.py b/google_drive_to_sqlite/utils.py
index 2d7fbd0..ffbb997 100644
--- a/google_drive_to_sqlite/utils.py
+++ b/google_drive_to_sqlite/utils.py
@@ -7,6 +7,17 @@ class FilesError(Exception):
pass
+def get_file(client, file_id, fields=None):
+ file_url = "https://www.googleapis.com/drive/v3/files/{}".format(file_id)
+ params = {}
+ if fields is not None:
+ params["fields"] = ",".join(fields)
+ return client.get(
+ file_url,
+ params=params,
+ ).json()
+
+
def paginate_files(client, *, corpora=None, q=None, fields=None):
pageToken = None
files_url = "https://www.googleapis.com/drive/v3/files"
OK, I'm happy with how files and folders work now. Next step: extract users out into drive_users
and set up the foreign key for lastModifyingUser
and a many-to-many drive_owners
table for owners
.
Urgh, this is annoying. Consider this lastModifyingUser
:
"lastModifyingUser": {
"kind": "drive#user",
"displayName": "Simon Willison",
"photoLink": "https://lh3.googleusercontent.com/a-/AOh14Gg9Loyxove5ocfBp0mg0u2afcTpM1no8QJnwbWnxw=s64",
"me": true,
"permissionId": "16974643384157631322",
"emailAddress": "...@gmail.com"
}
It doesn't have a user ID! Which is frustrating because I want to populate drive_users
with uniquely identified records.
That permissionId
does appear to be consistently the same ID for each user no matter how many times they appear though. I'm going to treat that as the user ID for the moment and see if it works.
Ended up adding these two tables:
CREATE TABLE [drive_folders_owners] (
[item_id] TEXT REFERENCES [drive_folders]([id]),
[user_id] TEXT REFERENCES [drive_users]([permissionId]),
PRIMARY KEY ([item_id],
[user_id])
);
CREATE TABLE [drive_files_owners] (
[item_id] TEXT REFERENCES [drive_files]([id]),
[user_id] TEXT REFERENCES [drive_users]([permissionId]),
PRIMARY KEY ([item_id],
[user_id])
);
Final schema is in the docs here: https://github.com/simonw/google-drive-to-sqlite/blob/319236562ec5c1564f507e65647a9863f20bc3a6/README.md#database-schema
Right now everything goes in one
files
table.