prior-art-archive / migration-2019

Scripts related to the migration of v1 to v2
MIT License
0 stars 0 forks source link

Identify the uploads linked to no discernible Organization #14

Open reefdog opened 5 years ago

reefdog commented 5 years ago

The /uploads directory of the assets.priorartarchive.org S3 bucket has directories (19 total) keyed by Organization.id.

Of these directories, 11 can be linked to organizations, but eight can't. I even checked them against the v1 database, and they don't exist there either.

directory v2.Organization.id v1.Company.id slug
1b5186be-eb68-4129-9795-6983198760ac 1b5186be-eb68-4129-9795-6983198760ac prod-test-account
4b096527-aded-4ce6-9cf1-85a22c3d3ff5 4b096527-aded-4ce6-9cf1-85a22c3d3ff5 ffca7fb0-e96c-4336-9687-303e1115abff cisco
70f25e84-388b-4f0c-8c46-18ac6d32aeaa 70f25e84-388b-4f0c-8c46-18ac6d32aeaa 2039bbf8-6112-4274-a7c3-4dd2c187a036 msoftadmin
a4098829-461e-4903-b121-101d50af67af a4098829-461e-4903-b121-101d50af67af 397ecfaf-7c1c-4387-83da-81eaf9bfbbb3 delladmin
e52c43ef-c4bd-4ff1-a8c2-a392dbc90a95 e52c43ef-c4bd-4ff1-a8c2-a392dbc90a95 xeroxadmin
ae69945c-8ec2-4698-8766-6ef15abfb7be ae69945c-8ec2-4698-8766-6ef15abfb7be magic-leap
031242a6-e323-4f7d-b091-f0bb5fbd3ed2 031242a6-e323-4f7d-b091-f0bb5fbd3ed2 msj
f1aca22d-777e-4025-ae8d-b76159303310 f1aca22d-777e-4025-ae8d-b76159303310 joelgustafson
f7dd5f57-2e03-4f9f-8848-64dddf7a9b9f f7dd5f57-2e03-4f9f-8848-64dddf7a9b9f jinjoolee
a7c2498e-fdda-4767-ad10-c80505ab9fcf a7c2498e-fdda-4767-ad10-c80505ab9fcf bjoshi
5574d994-d9b9-4a65-90bd-f1d6be9b4c63 5574d994-d9b9-4a65-90bd-f1d6be9b4c63 leviton
096e402e-b66d-460a-a503-8fc5bd9524f6
4872e7bc-cea5-4e8d-abcf-a20f7905ed1b
685c745f-d0c0-4bc0-bf0d-02dc77d47674
b260f099-4698-4f2f-84bf-7637db9a5d0d
b74bcfc5-5029-444d-bb3f-06597a056cfd
d3118d8c-ae60-4a56-8781-486aa59a3f1d
d32a7b3c-6310-42c8-be70-bb8796920cf8
e166c716-ad8f-40b9-9229-b7262cbc378b

We should figure out what these are. I've generated a complete recursive list of their contents. (Note that one directory actually contains three more directories, each with only one file.)

We need to sort out what these are.

reefdog commented 5 years ago

@metasj asked for a more readable list of the problematic files, so here's a Gist with a table of all the files already linked up, along with their timestamp and byte size. Also here's an XLSX and zipped CSV, for good measure.

(Each file path is implicitly rooted at s3://assets.priorartarchive.org/uploads.)

metasj commented 5 years ago

I don't see the directory names in the gist -- is that all files from all 8 directories, combined? The directory seems like the most important piece for debugging.

Most are Cisco files, by inspection; a few are test uploads.

reefdog commented 5 years ago

@metasj The directory names are built into the path name. E.g., 096e402e-b66d-460a-a503-8fc5bd9524f6/01549568245194.pdf is the file 01549568245194.pdf within the directory 096e402e-b66d-460a-a503-8fc5bd9524f6. You can also see the eight directories in the table above; they're the last eight rows, the ones with no corresponding v2.Organization.id or v1.Company.id.

(I'll go ahead and edit the Gist so the directory is its own column though, just for clarity!)

metasj commented 5 years ago

Got it, just hard to parse. We should have username for every account. These are perhaps users who didn't set an organization.

24f6: me ed1b: Joel? 7674: Travis? 5d0d: cisco file + title tests 6cfd: 3f1d: 0cf8: travis 378b: cisco test?