Identify the uploads linked to no discernible Organization

reefdog commented 5 years ago

The /uploads directory of the assets.priorartarchive.org S3 bucket has directories (19 total) keyed by Organization.id.

Of these directories, 11 can be linked to organizations, but eight can't. I even checked them against the v1 database, and they don't exist there either.

directory	v2.Organization.id	v1.Company.id	slug
1b5186be-eb68-4129-9795-6983198760ac	1b5186be-eb68-4129-9795-6983198760ac		prod-test-account
4b096527-aded-4ce6-9cf1-85a22c3d3ff5	4b096527-aded-4ce6-9cf1-85a22c3d3ff5	ffca7fb0-e96c-4336-9687-303e1115abff	cisco
70f25e84-388b-4f0c-8c46-18ac6d32aeaa	70f25e84-388b-4f0c-8c46-18ac6d32aeaa	2039bbf8-6112-4274-a7c3-4dd2c187a036	msoftadmin
a4098829-461e-4903-b121-101d50af67af	a4098829-461e-4903-b121-101d50af67af	397ecfaf-7c1c-4387-83da-81eaf9bfbbb3	delladmin
e52c43ef-c4bd-4ff1-a8c2-a392dbc90a95	e52c43ef-c4bd-4ff1-a8c2-a392dbc90a95		xeroxadmin
ae69945c-8ec2-4698-8766-6ef15abfb7be	ae69945c-8ec2-4698-8766-6ef15abfb7be		magic-leap
031242a6-e323-4f7d-b091-f0bb5fbd3ed2	031242a6-e323-4f7d-b091-f0bb5fbd3ed2		msj
f1aca22d-777e-4025-ae8d-b76159303310	f1aca22d-777e-4025-ae8d-b76159303310		joelgustafson
f7dd5f57-2e03-4f9f-8848-64dddf7a9b9f	f7dd5f57-2e03-4f9f-8848-64dddf7a9b9f		jinjoolee
a7c2498e-fdda-4767-ad10-c80505ab9fcf	a7c2498e-fdda-4767-ad10-c80505ab9fcf		bjoshi
5574d994-d9b9-4a65-90bd-f1d6be9b4c63	5574d994-d9b9-4a65-90bd-f1d6be9b4c63		leviton
096e402e-b66d-460a-a503-8fc5bd9524f6
4872e7bc-cea5-4e8d-abcf-a20f7905ed1b
685c745f-d0c0-4bc0-bf0d-02dc77d47674
b260f099-4698-4f2f-84bf-7637db9a5d0d
b74bcfc5-5029-444d-bb3f-06597a056cfd
d3118d8c-ae60-4a56-8781-486aa59a3f1d
d32a7b3c-6310-42c8-be70-bb8796920cf8
e166c716-ad8f-40b9-9229-b7262cbc378b

We should figure out what these are. I've generated a complete recursive list of their contents. (Note that one directory actually contains three more directories, each with only one file.)

We need to sort out what these are.

reefdog commented 5 years ago

@metasj asked for a more readable list of the problematic files, so here's a Gist with a table of all the files already linked up, along with their timestamp and byte size. Also here's an XLSX and zipped CSV, for good measure.

(Each file path is implicitly rooted at s3://assets.priorartarchive.org/uploads.)

metasj commented 5 years ago

I don't see the directory names in the gist -- is that all files from all 8 directories, combined? The directory seems like the most important piece for debugging.

Most are Cisco files, by inspection; a few are test uploads.

reefdog commented 5 years ago

@metasj The directory names are built into the path name. E.g., 096e402e-b66d-460a-a503-8fc5bd9524f6/01549568245194.pdf is the file 01549568245194.pdf within the directory 096e402e-b66d-460a-a503-8fc5bd9524f6. You can also see the eight directories in the table above; they're the last eight rows, the ones with no corresponding v2.Organization.id or v1.Company.id.

(I'll go ahead and edit the Gist so the directory is its own column though, just for clarity!)

metasj commented 5 years ago

Got it, just hard to parse. We should have username for every account. These are perhaps users who didn't set an organization.

24f6: me ed1b: Joel? 7674: Travis? 5d0d: cisco file + title tests 6cfd: 3f1d: 0cf8: travis 378b: cisco test?

prior-art-archive / migration-2019

Identify the uploads linked to no discernible Organization #14