simonw / airtable-export

Export Airtable data to YAML, JSON or SQLite files on disk
https://datasette.io/tools/airtable-export
Apache License 2.0
112 stars 15 forks source link

Option to export and save images and attachments #27

Open simonw opened 11 months ago

simonw commented 11 months ago

https://support.airtable.com/docs/airtable-attachment-url-behavior

The attachment viewer URL requires a viewer to be logged in via Airtable, whereas the download URL can be publicly accessed via a web browser. This presents a potential security risk for any files that contain sensitive information since a permanently accessible download URL would still be viewable even after a viewer has been removed from an Airtable workspace. For that reason, we introduced expiring attachment download URLs starting November 8th, 2022.

[...]

API - Similar to the previously mentioned CSV downloads, download any attachments from the links attachment obtained via Airtable’s API before those links expire.

This tool should grow a feature to download and save attachments, since the URLs are no longer a useful long-term reference.

How long until they expire? The docs say:

If you, or a third-party integration, are using API processes to obtain attachment URLs from Airtable, then it’s important to note that the URLs obtained will only be active for a short period of time (~2 hours). Using Airtable as a CDN has been and continues to be discouraged in line with this change.

simonw commented 11 months ago

The JSON for a PDF attachment looks like this:

[
    {
        "id": "attHd9sanAvr8PwGk",
        "url": "https://v5.airtableusercontent.com/v1/20/20/1694980800000/eql_yHjtzp6IHoic_KROvw/FjJhJto96zC-1yxspRtNNa_WjWcU8_L2xJDUJd65Hj530F56TDTkxrSUjaUfLfSibHmxEwOkf3rcsoANvGaO0pl2YVUb00YFOlSLhEq_5gA/zsiKz8BHmXq87lkH3o4j2AgV0NTuUsZIcyT5JduLMo0",
        "filename": "ardata.pdf",
        "size": 506898,
        "type": "application/pdf",
        "thumbnails": {
            "small": {
                "url": "https://v5.airtableusercontent.com/v1/20/20/1694980800000/BScJEWFsxF5MOcZJ_i8RsA/tPGX0GFgMelnlR88d41c6DJDl9pa9E3M5y3123HrDPrj_A4Kocl2StxwBLsb3dFSZrwIjcw5csCI0DQZ5d_5vA/zI1iLniIAYc-oZhHonurjZ7h2pu5vxlzbR1NE2DMTx0",
                "width": 28,
                "height": 36
            },
            "large": {
                "url": "https://v5.airtableusercontent.com/v1/20/20/1694980800000/QMvnNnH-ftXXOeBRYJk4fQ/io3cJpspKnpka1q-fAhsEiJCwPmReOow7NeGPl96ogH5HFXtqxQvge85hcz3EIzPJ7rr0-uzdAUoDWfPi4a0mw/mpJk4qmSys6QE8IqSp4whgrc-52UiOR1DD9X7rIiTh8",
                "width": 396,
                "height": 512
            }
        }
    }
]

For an image, this:

[
    {
        "id": "attwPzAABrX7mUpdl",
        "width": 4032,
        "height": 3024,
        "url": "https://v5.airtableusercontent.com/v1/20/20/1694937600000/-Ex6TxOvcFa-QRIDtsSZwg/kZ2Id34JwYwhYwqWUbhlqMLntfqEbe8J95Gnb1eXswOMlCh8lvxIujNMOojCSCJ8l5ZMO1UKsPPNZSLusJHcja8tkPPBnrBtvEpfof_ti6c/2UH1gYdWPHDAKGY1qe97mUSxskh2Q2uPKh_79MRdXmI",
        "filename": "2023-05-24 18.54.28.jpg",
        "size": 5237935,
        "type": "image/jpeg",
        "thumbnails": {
            "small": {
                "url": "https://v5.airtableusercontent.com/v1/20/20/1694937600000/4vruo5I_9EIlxlN7KPLozg/TFof3GLJo3MMxVLeEYqYAL7-g8RHHpD0mn3KzJB-ffPvKV9KXg8wVk_2onpLARuugA87bCEwLdhqMthHFoFr4g/6dCCAksO2neaFKqtEu5Tws0-CRuXrm8OP21XlHWYTFM",
                "width": 48,
                "height": 36
            },
            "large": {
                "url": "https://v5.airtableusercontent.com/v1/20/20/1694937600000/Sjt3pZJeeovYOiVtWEBuMQ/VJx0IHFStQ1CEvOKsaoeMixItvcw3gVQ-MxYDt60q98S9iAOiSxFmw47tUdd-ozsnPJ-bjOAJ8NxOb2eLcLdyQ/EKsuXWiXUzmiFedTTUmnx46s9JFnZXkq2UXcENSPyT8",
                "width": 683,
                "height": 512
            },
            "full": {
                "url": "https://v5.airtableusercontent.com/v1/20/20/1694937600000/uQVk3ZjVPk42MXgkdZ3Cjw/vFhGv1qIgRKSfC8Dah5TRgUqNSFxGW0UUc82KhUFjIt62kyT8Qr173dK91Y7R_keQEohFm25jgdWODR4LvVKMg/4yrXgtkH8nwhC4EsQ0oaQxrseOIQowDsRtOt0HbIRFA",
                "width": 3000,
                "height": 3000
            }
        }
    }
]
simonw commented 11 months ago

What filename should I use?

Two options: there's the id field - which is guaranteed to be unique, but isn't very nice to read. And there's the filename field which might be a duplicate.

Maybe the answer is to store them like this:

attwPzAABrX7mUpdl/2023-05-24 18.54.28.jpg

Or like this:

attwPzAABrX7mUpdl-2023-05-24 18.54.28.jpg

I'm leaning towards the latter - the airtable ID, then a hyphen, then the filename.

simonw commented 11 months ago

Are attachments unique to their tables or bases? Probably, but I can't be 100% sure.

simonw commented 11 months ago

Maybe the CLI grows a --attachments folder/ option which saves all attachments to the specified folder.

Or it could be a flag, so you just pass --attachments and it automatically creates output-folder/attachments/ and saves them in there.

I think I like the latter option better. I'll build that first.

simonw commented 11 months ago

I'm going to download the attachments as a second step, with a progress bar based on the total count of bytes to fetch.

Should I fetch the thumbnails too? I think so - even though that will slightly throw off the progress bar since I won't have the file size of those (it's not in the JSON).

I'll store those as attwPzAABrX7mUpdl-full-2023-05-24 18.54.28.jpg to avoid anything where filenames confuse things. I guess .jpg is the right extension for all thumbnails?

I'll use $airtable_id-original-%filename for the original to namespace it separately from the thumbnails.

simonw commented 11 months ago

Huh, no at least one of the PDF thumbnails turned out to be a PNG. I guess I should set the file extension based on the detected content type.

simonw commented 11 months ago

One slightly tricky thing is that the YAML/JSON flat files may end up with URLs to attachments which don't directly correspond to the files on disk.

Not sure the best way to fix that.

The stored YAML looks like this:

- Attachments:
  - filename: 2022-07-26 11.13.58.jpg
    height: 3024
    id: atttqayAwklSIehzC
    size: 5471198
    thumbnails:
      full:
        height: 3000
        url: https://v5.airtableusercontent.com/v1/20/20/1694930400000/LhLoyhSixZRtUMOIW4rDmQ/diboUo14Z7KGQWOvyTPvG11qQ4FtV-xtDLDMXdETthmjqdM1Aysp-Gwa1f9lvR8ojVSURy1pw1VaL9SEDLQWxQ/_30sml0SbUmdNgkW9moBIQX9xSBVGMG6iLbCvSwIhgE
        width: 3000
      large:
        height: 512
        url: https://v5.airtableusercontent.com/v1/20/20/1694930400000/_afO19TInFwIhO4-QVV-FQ/S-JgJs8MHCKUbV8scsZE4tZKwTrCl1yxq5KCFe0O97yDnSO-4aKqPKvDrv2NX0JRjeD4ecwhgnJJepTyF3BHQg/_G1mqoQr68W8-3QYo-9f2X2eiDzcDBumSDV_7DlerEM
        width: 683
      small:
        height: 36
        url: https://v5.airtableusercontent.com/v1/20/20/1694930400000/17No0cBf3trPUv_1wywegw/360f3uc0uxEj5BCu0pqb4l7RrR6LgYy4RDLImLyhCxgbFQ6g2Ztss-yU1ouxU5BqTpqFP5tI0OWM6PhtnZdlFw/XUp16QgEVA0RvjU94Lm3HZYre_l0a83WxzPAmead57o
        width: 48
    type: image/jpeg
    url: https://v5.airtableusercontent.com/v1/20/20/1694930400000/uZehbrwpqTtVT3f01w7cyQ/z6MGnx2f-cmr8Mrkwnj9m74ImZUwBUDOkYLPf9aAWhAhpEa6JLh7ydwJU_aDvU7B0nyi0_kfP1NERTq0IQz3eOYgrjCJh-z593y8WAzLawo/WuDrxvXaLSmGtf44yeduCy3pbsuwc-YTsvKGEsXpaaE
    width: 4032

I could rewrite this to insert _saved: filename next to each of those url fields.

simonw commented 11 months ago

So I think the full process is:

  1. Export and store the data as YAML/SQLite/etc
  2. Gather together all of those URLs and go and fetch archived copy of them (maybe skipping ones that have been downloaded previously)
  3. Update the stored YAML/SQLite to add the _saved keys pointing to those filenames.
simonw commented 11 months ago

I don't particularly want to have to round-trip the YAML/JSON formats, so I think I'll keep all that data in memory, snapshot it to disk, then do the attachment downloads and snapshot each file to disk again with the extra fields after the attachments have been saved.

simonw commented 11 months ago

Better option: do this on a per-table basis - so it loads a table into memory, snaps it to disk, saves any attachments, updates the snapshot on disk and then moves on to the next table.

I still like the early snapshot because I like that the user won't lose data if they cancel the operation during a large attachment download.

marks commented 1 week ago

👋 Hi @simonw - thank you for this great repo.

I was wondering if you'd be open to a PR for downloading attachments. I have a pretty simple MVP in this commit that works by downloading files after downloading all records. WDYT?

It's slightly hacky in that it does not rely on --schema also being called and knowing exactly which fields are truly attachment fields, but this keeps the code quite a bit simpler.