simonw / google-drive-to-sqlite

Create a SQLite database containing metadata from Google Drive
https://datasette.io/tools/google-drive-to-sqlite
Apache License 2.0
152 stars 13 forks source link

`google-drive-to-sqlite download FILE_ID` command #4

Closed simonw closed 2 years ago

simonw commented 2 years ago

Here's the recipe that worked for retrieving the binary contents of a file - the trick is the alt=media parameter:

def get_binary(file_id):
    return httpx.get(
        "https://www.googleapis.com/drive/v3/files/{}?alt=media".format(file_id),
        headers={
            "Authorization": "Bearer {}".format(access_token)
        }
    ).content

Originally posted by @simonw in https://github.com/simonw/google-drive-to-sqlite/issues/1#issuecomment-1041026849

simonw commented 2 years ago

May need to do two API requests here - one to figure out the filename, extension and content type, the other to retrieve the contents.

simonw commented 2 years ago

I'll use an httpx streaming download: https://www.python-httpx.org/quickstart/#streaming-responses

>>> with httpx.stream("GET", "https://www.example.com") as r:
...     for data in r.iter_bytes():
...         print(data)
simonw commented 2 years ago

If you provide a -o option I'll use that as the filename without running a separate request.

Here's the API call to get the filename:

% google-drive-to-sqlite get 'https://www.googleapis.com/drive/v3/files/0B32uDVNZfiEKOUliZ3NMNC0yV3RiYkZabjhlSHhXZ2UxOEZn'
{
    "kind": "drive#file",
    "id": "0B32uDVNZfiEKOUliZ3NMNC0yV3RiYkZabjhlSHhXZ2UxOEZn",
    "name": "Items recovered #3.jpg",
    "mimeType": "image/jpeg",
    "resourceKey": "0-ZZIGVxj4zbk5mf3HltMvsg"
}

The problem with using the filename from that API is that the user will run the command and then have no idea where the file was saved.

wget simonwillison.net solves that by showing you in the output to stderr:

...
2022-02-16 19:34:57 (44.7 MB/s) - ‘index.html’ saved [93300]

Run that a second time:

2022-02-16 19:35:30 (3.25 MB/s) - ‘index.html.1’ saved [93300]

Which seems right - it's definitely bad to over-write an existing file with the same name.

simonw commented 2 years ago

I'm going to support multiple file IDs.

If you use the -o option you can only use a single ID.

-o - will write to standard output.

simonw commented 2 years ago

Design for this feature then:

# Look up FILE_ID's name, download it to that filename on disk
google-drive-to-sqlite download FILE_ID
# If that filename already exists, adds .1 or .2 etc to the end
# This shows information to stderr, unless you pass `-s` or `--silent`

# Control the output filename (and save one API request)
google-drive-to-sqlite download FILE_ID -o out.jpg
# This will over-write out.jpg if it already exists

# Download multiple files
google-drive-to-sqlite download FILE1 FILE2 FILE3
# This saves them with their filenames fetched by the other API request

Maybe it would be good to have a --use-file-id-as-filename option (but something a bit snappier than that)? I'd still need to do an API call to figure out the file extension to use... or that option could skip file extensions entirely.

simonw commented 2 years ago

Excellent news: the HTTP headers that come back with the ?alt=media response include the content-type, which I can use to derive the correct file extension!

{
    "x-guploader-uploadid": "ADPycdshfScGL5dQyKL_NvivgOFaC4gGfekqs5Gq0sAIVZQQUdU0hLheFGAUjFMs6Bf-P7GMLZxIVVZnSVWPVg8J6ic",
    "content-type": "image/jpeg",
    "content-disposition": "attachment",
    "vary": "Origin, X-Origin",
    "expires": "Thu, 17 Feb 2022 03:58:21 GMT",
    "date": "Thu, 17 Feb 2022 03:58:21 GMT",
    "cache-control": "private, max-age=0, must-revalidate",
    "content-length": "45237",
    "x-goog-hash": "crc32c=Lpascw==",
    "server": "UploadServer",
    "alt-svc": "h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000,h3-Q050=\":443\"; ma=2592000,h3-Q046=\":443\"; ma=2592000,h3-Q043=\":443\"; ma=2592000,quic=\":443\"; ma=2592000; v=\"46,43\""
}
simonw commented 2 years ago

For a PDF it's: content-type": "application/pdf"

I'm going to split on / and use the last token as the file extension.

simonw commented 2 years ago

Since it returns content-length too I could even have a progress bar.

simonw commented 2 years ago

Rough prototype with working progress bar (for files that return content-length - a video did, two PDFs did not):

@cli.command()
@click.argument("file_id")
@click.option(
    "-a",
    "--auth",
    type=click.Path(file_okay=True, dir_okay=False, allow_dash=True),
    default="auth.json",
    help="Path to auth.json token file",
)
def download(file_id, auth):
    access_token = load_token(auth)
    with httpx.stream(
        "GET",
        "https://www.googleapis.com/drive/v3/files/{}?alt=media".format(file_id),
        headers={"Authorization": "Bearer {}".format(access_token)},
    ) as r:
        click.echo(json.dumps(dict(r.headers), indent=4), err=True)
        click.echo("Writing to FILE.bin")
        fp = open("FILE.bin", "wb")
        length = int(r.headers.get("content-length", "0"))
        if length:
            with click.progressbar(
                length=int(r.headers["content-length"]), label="FILENAME"
            ) as bar:
                for data in r.iter_bytes():
                    fp.write(data)
                    bar.update(len(data))
        else:
            for data in r.iter_bytes():
                fp.write(data)
simonw commented 2 years ago

Actually I'm not going to fetch the filename at all unless the user asks me to. The first version of this will use OAENTUHAOTNUHAN.gif - file ID plus extension.

simonw commented 2 years ago

Documentation: https://github.com/simonw/google-drive-to-sqlite/blob/7f09c817546071413d4c2a9754d323b0f88e45e6/README.md#google-drive-to-sqlite-download-file_id