Redesign schema to have `drive_files`, `drive_folders` and `drive_users`

simonw commented 2 years ago

Right now everything goes in one files table.

simonw commented 2 years ago

I think I need three tables:

drive_files - the current files table with a couple of things extracted
drive_users - needed for concepts like ownership
drive_folders for folders

The hardest data modeling challenge here is folders. How should hierarchy be represented? Also, can a file be in more than one folder at once? parents is an array which suggests that it can.

simonw commented 2 years ago

Looks like the multiple parents thing was removed in September 2020 https://cloud.google.com/blog/products/g-suite/simplifying-google-drives-folder-structure-and-sharing-models

simonw commented 2 years ago

Here's what a file looks like:

{
  "kind": "drive#file",
  "id": "1Xdqfeoi8B8YJJR0y-_oQlHYpjHHzD5a-",
  "name": "california_warn_raw.csv",
  "mimeType": "text/csv",
  "starred": false,
  "trashed": false,
  "explicitlyTrashed": false,
  "parents": [
    "113Wb_KLL1dtgx3vpeRfSTOYIUDf3QnnN"
  ],
  "spaces": [
    "drive"
  ],
  "version": "2",
  "webViewLink": "https://drive.google.com/file/d/1Xdqfeoi8B8YJJR0y-_oQlHYpjHHzD5a-/view?usp=drivesdk",
  "iconLink": "https://drive-thirdparty.googleusercontent.com/16/type/text/csv",
  "hasThumbnail": false,
  "thumbnailVersion": "0",
  "viewedByMe": true,
  "createdTime": "2022-02-19T04:25:16.517Z",
  "modifiedTime": "2020-11-11T18:10:31.000Z",
  "modifiedByMe": true,
  "owners": [
    {
      "kind": "drive#user",
      "displayName": "Simon Willison",
      "photoLink": "https://lh3.googleusercontent.com/a-/AOh14Gg9Loyxove5ocfBp0mg0u2afcTpM1no8QJnwbWnxw=s64",
      "me": true,
      "permissionId": "16974643384157631322",
      "emailAddress": "swillison@gmail.com"
    }
  ],
  "lastModifyingUser": {
    "kind": "drive#user",
    "displayName": "Simon Willison",
    "photoLink": "https://lh3.googleusercontent.com/a-/AOh14Gg9Loyxove5ocfBp0mg0u2afcTpM1no8QJnwbWnxw=s64",
    "me": true,
    "permissionId": "16974643384157631322",
    "emailAddress": "...@gmail.com"
  },
  "shared": false,
  "ownedByMe": true,
  "viewersCanCopyContent": true,
  "copyRequiresWriterPermission": false,
  "writersCanShare": true,
  "quotaBytesUsed": "1070506",
  "isAppAuthorized": false,
  "linkShareMetadata": {
    "securityUpdateEligible": false,
    "securityUpdateEnabled": true
  }
}

Despite parents being a list we know that it should only ever contain a single parent - so we can treat that as a nullable single item.

owners is a list, so it should become a drive_owners many-to-many table against drive_users

lastModifyingUser should be a foreign key against drive_users.

simonw commented 2 years ago

I'm going to turn parents into a single parent that's a foreign key to drive_folders.

Should a folder be stored in both drive_files AND drive_folders? I think not - I think folders should live only in drive_folders.

simonw commented 2 years ago

Of note: currently this command:

google-drive-to-sqlite files --folder '1dbccBzomcvEUGdnoj8-9QG1yHxS0R-_j' --json

Does NOT retrieve the details of the folder with ID 1dbccBzomcvEUGdnoj8-9QG1yHxS0R-_j - it only gets its children.

It would be better if it also retrieved the folder itself, using:

google-drive-to-sqlite get \
  'https://www.googleapis.com/drive/v3/files/1dbccBzomcvEUGdnoj8-9QG1yHxS0R-_j?fields=*'

(But with the proper list of fields)

simonw commented 2 years ago

Relevant challenge: if a user requests a specific folder and we don't have that folder's parents stored, what do we do about it with regards to foreign keys and the hierarchy?

A few options:

Store the parent ID as the parent but tolerate it referencing a record that doesn't exist
Fetch the parent if it's missing, and keep on going all the way up to the user's drive (which has no parents key)
Store a parent of null
Store a parent of null but store an additional column (maybe parent_id?) which isn't a foreign key but does have the missing reference in it
Store parents as a JSON array, but don't store parent

I'm leaning towards that last option - so every folder would have a parents JSON array which is what the API returned, and then most folders would have a parent foreign key which the tool adds.

Might be a bit confusing though?

simonw commented 2 years ago

Another thing worth considering: supporting efficient "all files and folders within folder X" SQL queries without the user having to run a recursive CTE (since those are hard).

I like the three tree mechanisms implemented in https://django-treebeard.readthedocs.io/ - for this tool I think a materialized path might be best, which could look like this:

_path: 0AK1CICIR8ECDUk9PVA/113Wb_KLL1dtgx3vpeRfSTOYIUDf3QnnN/1Xdqfeoi8B8YJJR0y-_oQlHYpjHHzD5a-

The _ prefix is to avoid it clashing with path should they add that to the Google Drive API in the future. I should go with _parent too if I invent that field.

One catch with this: if I don't recursively fetch the parents all the way up to the root I'll end up storing incomplete paths, which seems both unexpected and breaks the model where you can efficiently query every file in a folder using where _path like '0AK1CICIR8ECDUk9PVA/113Wb_KLL1dtgx3vpeRfSTOYIUDf3QnnN/%'

So maybe I should recursively fetch parents after all?

simonw commented 2 years ago

Also messy: what should I do if the user runs this:

google-drive-to-sqlite files files.db --full-text dogs

And the search returns hundreds of matches, each contained in a different part of the folder hierarchy?

Should the tool also fetch the full parent list for every single one of those matches, in order to populate the drive_folders table and correctly populate the full _path hierarchy for every file? That could add up to a LOT of extra queries - potentially 3 or 4 extras per file that came back in the search results, depending on how deep and differing the file hierarchies are.

simonw commented 2 years ago

Solution: only populate _path and _parent if it is efficient to do so.

This is true for the --folder case because it's already walking the hierarchy in a way that will cover every one of the options (bar a few additional queries to find the parents of the starting folder).

For other searches I can leave those fields as null - unless the user adds a --populate-folders option in which case we can do all of that extra work because they opted in to it.

simonw commented 2 years ago

I'm actually now leaning towards populating _parent as a foreign key that isn't enforced, to keep things a little bit simpler - at the expense of maybe making queries harder?

I'm going to try that first and see how well it works.

simonw commented 2 years ago

Prototype worked pretty nicely:

diff --git a/google_drive_to_sqlite/cli.py b/google_drive_to_sqlite/cli.py
index da8fc21..cc91298 100644
--- a/google_drive_to_sqlite/cli.py
+++ b/google_drive_to_sqlite/cli.py
@@ -302,8 +302,32 @@ def files(database, auth, folder, q, full_text, json_, nl, stop_after):
     db = sqlite_utils.Database(database)
     # Commit every 100 records
     for chunk in chunks(all, 100):
+        # Add `_parent` columns
+        files = []
+        folders = []
+        for file in chunk:
+            file["_parent"] = file["parents"][0] if file.get("parents") else None
+            if file["mimeType"] == "application/vnd.google-apps.folder":
+                folders.append(file)
+            else:
+                files.append(file)
         with db.conn:
-            db["files"].insert_all(chunk, pk="id", replace=True)
+            if not db["drive_folders"].exists():
+                db["drive_folders"].create({"id": str}, pk="id")
+            db["drive_folders"].insert_all(
+                folders,
+                pk="id",
+                replace=True,
+                alter=True,
+                foreign_keys=(("_parent", "drive_folders", "id"),),
+            )
+            db["drive_files"].insert_all(
+                files,
+                pk="id",
+                replace=True,
+                alter=True,
+                foreign_keys=(("_parent", "drive_folders", "id"),),
+            )

 def load_tokens(auth):

simonw commented 2 years ago

Started experimenting with a recursive CTE but I don't think I've got it quite right yet:

with recursive folders as (
    select id, name, _parent
        from drive_folders
        where _parent is null
    union
        select id, name, _parent || '/' || drive_folders._parent
        from drive_folders
  )
select * from folders

simonw commented 2 years ago

Latest prototype - I think I'm ready to turn this into a tested, documented feature now:

diff --git a/google_drive_to_sqlite/cli.py b/google_drive_to_sqlite/cli.py
index da8fc21..99102af 100644
--- a/google_drive_to_sqlite/cli.py
+++ b/google_drive_to_sqlite/cli.py
@@ -8,7 +8,7 @@ import sqlite_utils
 import sys
 import textwrap
 import urllib.parse
-from .utils import APIClient, files_in_folder_recursive, paginate_files
+from .utils import APIClient, get_file, files_in_folder_recursive, paginate_files

 # https://github.com/simonw/google-drive-to-sqlite/issues/2
 GOOGLE_CLIENT_ID = (
@@ -299,11 +299,58 @@ def files(database, auth, folder, q, full_text, json_, nl, stop_after):
         for line in stream_indented_json(all):
             click.echo(line)
         return
+
+    if folder:
+        # Fetch details of that folder first
+        folder_details = get_file(client, folder, fields=DEFAULT_FIELDS)
+        old_all = all
+
+        def _all():
+            yield folder_details
+            yield from old_all
+
+        all = _all()
+
     db = sqlite_utils.Database(database)
+    save_files_and_folders(db, all)
+
+
+def save_files_and_folders(db, all):
     # Commit every 100 records
     for chunk in chunks(all, 100):
+        # Add `_parent` columns
+        files = []
+        folders = []
+        for file in chunk:
+            file["_parent"] = file["parents"][0] if file.get("parents") else None
+            if file["mimeType"] == "application/vnd.google-apps.folder":
+                folders.append(file)
+            else:
+                files.append(file)
         with db.conn:
-            db["files"].insert_all(chunk, pk="id", replace=True)
+            for table in ("drive_folders", "drive_files"):
+                if not db[table].exists():
+                    db[table].create(
+                        {"id": str, "_parent": str},
+                        pk="id",
+                    )
+                    # Gotta add foreign key after table is created, to avoid
+                    # AlterError: No such column: drive_folders.id
+                    db[table].add_foreign_key("_parent", "drive_folders", "id")
+            db["drive_folders"].insert_all(
+                folders,
+                pk="id",
+                replace=True,
+                alter=True,
+                foreign_keys=(("_parent", "drive_folders", "id"),),
+            )
+            db["drive_files"].insert_all(
+                files,
+                pk="id",
+                replace=True,
+                alter=True,
+                foreign_keys=(("_parent", "drive_folders", "id"),),
+            )

 def load_tokens(auth):
diff --git a/google_drive_to_sqlite/utils.py b/google_drive_to_sqlite/utils.py
index 2d7fbd0..ffbb997 100644
--- a/google_drive_to_sqlite/utils.py
+++ b/google_drive_to_sqlite/utils.py
@@ -7,6 +7,17 @@ class FilesError(Exception):
     pass

+def get_file(client, file_id, fields=None):
+    file_url = "https://www.googleapis.com/drive/v3/files/{}".format(file_id)
+    params = {}
+    if fields is not None:
+        params["fields"] = ",".join(fields)
+    return client.get(
+        file_url,
+        params=params,
+    ).json()
+
+
 def paginate_files(client, *, corpora=None, q=None, fields=None):
     pageToken = None
     files_url = "https://www.googleapis.com/drive/v3/files"

simonw commented 2 years ago

OK, I'm happy with how files and folders work now. Next step: extract users out into drive_users and set up the foreign key for lastModifyingUser and a many-to-many drive_owners table for owners.

simonw commented 2 years ago

Urgh, this is annoying. Consider this lastModifyingUser:

    "lastModifyingUser": {
      "kind": "drive#user",
      "displayName": "Simon Willison",
      "photoLink": "https://lh3.googleusercontent.com/a-/AOh14Gg9Loyxove5ocfBp0mg0u2afcTpM1no8QJnwbWnxw=s64",
      "me": true,
      "permissionId": "16974643384157631322",
      "emailAddress": "...@gmail.com"
    }

It doesn't have a user ID! Which is frustrating because I want to populate drive_users with uniquely identified records.

That permissionId does appear to be consistently the same ID for each user no matter how many times they appear though. I'm going to treat that as the user ID for the moment and see if it works.

simonw commented 2 years ago

Ended up adding these two tables:

CREATE TABLE [drive_folders_owners] (
   [item_id] TEXT REFERENCES [drive_folders]([id]),
   [user_id] TEXT REFERENCES [drive_users]([permissionId]),
   PRIMARY KEY ([item_id],
   [user_id])
);
CREATE TABLE [drive_files_owners] (
   [item_id] TEXT REFERENCES [drive_files]([id]),
   [user_id] TEXT REFERENCES [drive_users]([permissionId]),
   PRIMARY KEY ([item_id],
   [user_id])
);

Final schema is in the docs here: https://github.com/simonw/google-drive-to-sqlite/blob/319236562ec5c1564f507e65647a9863f20bc3a6/README.md#database-schema

simonw / google-drive-to-sqlite

Redesign schema to have `drive_files`, `drive_folders` and `drive_users` #9