ualbertalib / jupiter

Jupiter is a University of Alberta Libraries-based initiative to create a sustainable and extensible digital asset management system. This is phase 2 (Digitization).
https://era.library.ualberta.ca/
MIT License
23 stars 10 forks source link

Items/Theses with file attachments that are Fedora error messages #2975

Open pgwillia opened 2 years ago

pgwillia commented 2 years ago

These items might

We are looking to identify any items or theses that may have this issue and create a report with at least the id, title and url.

If we can figure out a pattern to match this might be a good second round question for https://docs.google.com/document/d/1kjBhKqekIuH4VD_FFz1B668gumrJiHluoMh8fEh4Ktc/edit#heading=h.2fir5v5sus5t

Originally posted by @pgwillia in https://github.com/ualbertalib/digital-preservation/issues/45#issuecomment-1249716960

Related: https://github.com/ualbertalib/digital-preservation/issues/45 and https://github.com/ualbertalib/jupiter/issues/2043

ConnorSheremeta commented 1 year ago
blob_id(checked 222749),object_uuid(empty if orphan),file_name,file path,item path
4f1ac054-21e4-463f-94fb-86b56d3b5d52,34a05341-65e6-4732-a341-134559e8475e,8710531-NL22923.pdf,https://era.test.library.ualberta.ca/items/34a05341-65e6-4732-a341-134559e8475e/view/975d9c55-2b63-4730-a096-075568efb797/8710531-NL22923.pdf,https://era.test.library.ualberta.ca/items/34a05341-65e6-4732-a341-134559e8475e
80b30020-6786-4450-a070-b1404a860f8a,96b4e2de-e0da-4225-826a-550d310da6d1,WorkflowEngine.pdf,https://era.test.library.ualberta.ca/items/96b4e2de-e0da-4225-826a-550d310da6d1/view/e8996143-f1de-4255-aa69-9f793eb4717d/WorkflowEngine.pdf,https://era.test.library.ualberta.ca/items/96b4e2de-e0da-4225-826a-550d310da6d1
d436634d-dd40-49b9-ad2b-4439d1e06487,9dd00e26-85a9-4baa-8a46-59f82a821b18,Monica%20Fraser.pdf,https://era.test.library.ualberta.ca/items/9dd00e26-85a9-4baa-8a46-59f82a821b18/view/e01d30e1-adaa-48da-ac7b-32bc70f5047c/Monica-20Fraser.pdf,https://era.test.library.ualberta.ca/items/9dd00e26-85a9-4baa-8a46-59f82a821b18
e1aee2ac-69ca-4fcd-bbe3-17f3cccb55ab,66cf48a0-f6a7-4429-9349-7507a5475953,MM64875-MM94917.pdf,https://era.test.library.ualberta.ca/items/66cf48a0-f6a7-4429-9349-7507a5475953/view/3d484bad-4a13-43c8-9605-8acbe2783c84/MM64875-MM94917.pdf,https://era.test.library.ualberta.ca/items/66cf48a0-f6a7-4429-9349-7507a5475953
28a60087-6956-4bee-b247-b804a405c251,,Zhenhua_Li-PhD_thesis_-_submission.pdf
78c401bb-004e-4433-b544-dcb0e2a276c1,,Monica%20Fraser.pdf
c9b4bfe5-ccb9-4674-932c-8a420d1895bf,,scan.pdf
ecd2b144-d74b-46d3-8429-91c6b82b40bf,,WorkflowEngine.pdf

(era_fedora_file_issues.txt)

These are the problematic files I have currently found (their mime type (blob's content_type) states 'application/pdf' yet the associated file does not have an extension of '.pdf'. The second half of these are orphans and would be cleaned up using the garbage collect orphan blobs rake task. I will be loosening up the parameters to hopefully catch some more by cross checking all of the blobs content_types to the expected extension through a look-up table.