Open simonw opened 11 months ago
The JSON for a PDF attachment looks like this:
[
{
"id": "attHd9sanAvr8PwGk",
"url": "https://v5.airtableusercontent.com/v1/20/20/1694980800000/eql_yHjtzp6IHoic_KROvw/FjJhJto96zC-1yxspRtNNa_WjWcU8_L2xJDUJd65Hj530F56TDTkxrSUjaUfLfSibHmxEwOkf3rcsoANvGaO0pl2YVUb00YFOlSLhEq_5gA/zsiKz8BHmXq87lkH3o4j2AgV0NTuUsZIcyT5JduLMo0",
"filename": "ardata.pdf",
"size": 506898,
"type": "application/pdf",
"thumbnails": {
"small": {
"url": "https://v5.airtableusercontent.com/v1/20/20/1694980800000/BScJEWFsxF5MOcZJ_i8RsA/tPGX0GFgMelnlR88d41c6DJDl9pa9E3M5y3123HrDPrj_A4Kocl2StxwBLsb3dFSZrwIjcw5csCI0DQZ5d_5vA/zI1iLniIAYc-oZhHonurjZ7h2pu5vxlzbR1NE2DMTx0",
"width": 28,
"height": 36
},
"large": {
"url": "https://v5.airtableusercontent.com/v1/20/20/1694980800000/QMvnNnH-ftXXOeBRYJk4fQ/io3cJpspKnpka1q-fAhsEiJCwPmReOow7NeGPl96ogH5HFXtqxQvge85hcz3EIzPJ7rr0-uzdAUoDWfPi4a0mw/mpJk4qmSys6QE8IqSp4whgrc-52UiOR1DD9X7rIiTh8",
"width": 396,
"height": 512
}
}
}
]
For an image, this:
[
{
"id": "attwPzAABrX7mUpdl",
"width": 4032,
"height": 3024,
"url": "https://v5.airtableusercontent.com/v1/20/20/1694937600000/-Ex6TxOvcFa-QRIDtsSZwg/kZ2Id34JwYwhYwqWUbhlqMLntfqEbe8J95Gnb1eXswOMlCh8lvxIujNMOojCSCJ8l5ZMO1UKsPPNZSLusJHcja8tkPPBnrBtvEpfof_ti6c/2UH1gYdWPHDAKGY1qe97mUSxskh2Q2uPKh_79MRdXmI",
"filename": "2023-05-24 18.54.28.jpg",
"size": 5237935,
"type": "image/jpeg",
"thumbnails": {
"small": {
"url": "https://v5.airtableusercontent.com/v1/20/20/1694937600000/4vruo5I_9EIlxlN7KPLozg/TFof3GLJo3MMxVLeEYqYAL7-g8RHHpD0mn3KzJB-ffPvKV9KXg8wVk_2onpLARuugA87bCEwLdhqMthHFoFr4g/6dCCAksO2neaFKqtEu5Tws0-CRuXrm8OP21XlHWYTFM",
"width": 48,
"height": 36
},
"large": {
"url": "https://v5.airtableusercontent.com/v1/20/20/1694937600000/Sjt3pZJeeovYOiVtWEBuMQ/VJx0IHFStQ1CEvOKsaoeMixItvcw3gVQ-MxYDt60q98S9iAOiSxFmw47tUdd-ozsnPJ-bjOAJ8NxOb2eLcLdyQ/EKsuXWiXUzmiFedTTUmnx46s9JFnZXkq2UXcENSPyT8",
"width": 683,
"height": 512
},
"full": {
"url": "https://v5.airtableusercontent.com/v1/20/20/1694937600000/uQVk3ZjVPk42MXgkdZ3Cjw/vFhGv1qIgRKSfC8Dah5TRgUqNSFxGW0UUc82KhUFjIt62kyT8Qr173dK91Y7R_keQEohFm25jgdWODR4LvVKMg/4yrXgtkH8nwhC4EsQ0oaQxrseOIQowDsRtOt0HbIRFA",
"width": 3000,
"height": 3000
}
}
}
]
What filename should I use?
Two options: there's the id
field - which is guaranteed to be unique, but isn't very nice to read. And there's the filename
field which might be a duplicate.
Maybe the answer is to store them like this:
attwPzAABrX7mUpdl/2023-05-24 18.54.28.jpg
Or like this:
attwPzAABrX7mUpdl-2023-05-24 18.54.28.jpg
I'm leaning towards the latter - the airtable ID, then a hyphen, then the filename.
Are attachments unique to their tables or bases? Probably, but I can't be 100% sure.
Maybe the CLI grows a --attachments folder/
option which saves all attachments to the specified folder.
Or it could be a flag, so you just pass --attachments
and it automatically creates output-folder/attachments/
and saves them in there.
I think I like the latter option better. I'll build that first.
I'm going to download the attachments as a second step, with a progress bar based on the total count of bytes to fetch.
Should I fetch the thumbnails too? I think so - even though that will slightly throw off the progress bar since I won't have the file size of those (it's not in the JSON).
I'll store those as attwPzAABrX7mUpdl-full-2023-05-24 18.54.28.jpg
to avoid anything where filenames confuse things. I guess .jpg
is the right extension for all thumbnails?
I'll use $airtable_id-original-%filename
for the original to namespace it separately from the thumbnails.
Huh, no at least one of the PDF thumbnails turned out to be a PNG. I guess I should set the file extension based on the detected content type.
One slightly tricky thing is that the YAML/JSON flat files may end up with URLs to attachments which don't directly correspond to the files on disk.
Not sure the best way to fix that.
The stored YAML looks like this:
- Attachments:
- filename: 2022-07-26 11.13.58.jpg
height: 3024
id: atttqayAwklSIehzC
size: 5471198
thumbnails:
full:
height: 3000
url: https://v5.airtableusercontent.com/v1/20/20/1694930400000/LhLoyhSixZRtUMOIW4rDmQ/diboUo14Z7KGQWOvyTPvG11qQ4FtV-xtDLDMXdETthmjqdM1Aysp-Gwa1f9lvR8ojVSURy1pw1VaL9SEDLQWxQ/_30sml0SbUmdNgkW9moBIQX9xSBVGMG6iLbCvSwIhgE
width: 3000
large:
height: 512
url: https://v5.airtableusercontent.com/v1/20/20/1694930400000/_afO19TInFwIhO4-QVV-FQ/S-JgJs8MHCKUbV8scsZE4tZKwTrCl1yxq5KCFe0O97yDnSO-4aKqPKvDrv2NX0JRjeD4ecwhgnJJepTyF3BHQg/_G1mqoQr68W8-3QYo-9f2X2eiDzcDBumSDV_7DlerEM
width: 683
small:
height: 36
url: https://v5.airtableusercontent.com/v1/20/20/1694930400000/17No0cBf3trPUv_1wywegw/360f3uc0uxEj5BCu0pqb4l7RrR6LgYy4RDLImLyhCxgbFQ6g2Ztss-yU1ouxU5BqTpqFP5tI0OWM6PhtnZdlFw/XUp16QgEVA0RvjU94Lm3HZYre_l0a83WxzPAmead57o
width: 48
type: image/jpeg
url: https://v5.airtableusercontent.com/v1/20/20/1694930400000/uZehbrwpqTtVT3f01w7cyQ/z6MGnx2f-cmr8Mrkwnj9m74ImZUwBUDOkYLPf9aAWhAhpEa6JLh7ydwJU_aDvU7B0nyi0_kfP1NERTq0IQz3eOYgrjCJh-z593y8WAzLawo/WuDrxvXaLSmGtf44yeduCy3pbsuwc-YTsvKGEsXpaaE
width: 4032
I could rewrite this to insert _saved: filename
next to each of those url
fields.
So I think the full process is:
_saved
keys pointing to those filenames.I don't particularly want to have to round-trip the YAML/JSON formats, so I think I'll keep all that data in memory, snapshot it to disk, then do the attachment downloads and snapshot each file to disk again with the extra fields after the attachments have been saved.
Better option: do this on a per-table basis - so it loads a table into memory, snaps it to disk, saves any attachments, updates the snapshot on disk and then moves on to the next table.
I still like the early snapshot because I like that the user won't lose data if they cancel the operation during a large attachment download.
👋 Hi @simonw - thank you for this great repo.
I was wondering if you'd be open to a PR for downloading attachments. I have a pretty simple MVP in this commit that works by downloading files after downloading all records. WDYT?
It's slightly hacky in that it does not rely on --schema
also being called and knowing exactly which fields are truly attachment fields, but this keeps the code quite a bit simpler.
https://support.airtable.com/docs/airtable-attachment-url-behavior
This tool should grow a feature to download and save attachments, since the URLs are no longer a useful long-term reference.
How long until they expire? The docs say: