pulibrary / figgy

Valkyrie-based digital repository backend.
Other
35 stars 4 forks source link

Broken PDF Items #6366

Closed tpendragon closed 1 week ago

tpendragon commented 2 months ago

Tom Ventimiglia says the following:

I was looking at one of our Figgy MVWs recently, and noticed that there are still unprocessed files: https://figgy.princeton.edu/catalog/3931aa55-afd6-4c29-88bc-afaac7251f38 . There are a number of volumes with unprocessed images. However, it appears that in all these cases, the images are duplicates. There are two sets of identical images, and part of the second set is unprocessed. Simply deleting the duplicates fixes the problem. I started doing this but there are a lot. Trey thought that it might be possible to regenerate the image set from the PDFs, which would take care of the duplicates and the unprocessed images. Would this be possible? If so, I can provide a list of the affected volumes (17 in all)

https://figgy.princeton.edu/catalog/8e47ebd4-32a0-46a9-84ba-a5360b35f91b https://figgy.princeton.edu/catalog/3c289f5b-2584-416b-90d0-9b6252897754 https://figgy.princeton.edu/catalog/79e504c4-70e1-4e18-ad69-ecb99868f587 https://figgy.princeton.edu/catalog/8aa5af1f-b6c9-48d8-8222-8f9634d8896b https://figgy.princeton.edu/catalog/c27e01c8-e0f4-4865-9182-c5a145cca052 https://figgy.princeton.edu/catalog/9ab03dca-14bd-431c-ab3a-902d19cb4ac5 https://figgy.princeton.edu/catalog/e4353eec-ff2f-4f8a-8a3e-9cea466cb75f https://figgy.princeton.edu/catalog/f2440b6f-bd8f-418c-81ff-13bf146f71fd https://figgy.princeton.edu/catalog/0e1c7500-cc38-429e-9c28-336d0b73f72d https://figgy.princeton.edu/catalog/7c617e34-a827-413b-ada7-151b327bcc70 https://figgy.princeton.edu/catalog/3b5f3f58-2a71-404d-acf5-3b8227e4c58d https://figgy.princeton.edu/catalog/a66a0ec3-8413-48bd-ba48-44cc7bea0dd4 https://figgy.princeton.edu/catalog/3b850ae4-08be-4b85-bbaa-2d88b00ec740 https://figgy.princeton.edu/catalog/4708281f-734a-41d6-8324-c360b9fde8ac https://figgy.princeton.edu/catalog/eaab6aff-1a83-4694-b817-a66adf9a0f4d https://figgy.princeton.edu/catalog/4c09654f-b477-4d69-82cd-501fd9759d65 https://figgy.princeton.edu/catalog/28281766-a393-4539-bcc7-84968eea0024

I was looking at the other items in this series, and it turns out there are others with the image duplication issue, though they do not necessarily show the images as still processing. But they all gave a "Health Status" warning. Here is another list, which is in addition to the first one I sent:

https://figgy.princeton.edu/catalog/3d54455e-c823-4d97-94c5-985f7da0b41a https://figgy.princeton.edu/catalog/47d480bd-cd31-4858-b4cd-0a2f3d3007be https://figgy.princeton.edu/catalog/5344efb6-2ae2-4f68-bc58-b27b08748846 https://figgy.princeton.edu/catalog/493d5f33-f524-464b-8d7b-5819dbfe612f https://figgy.princeton.edu/catalog/d67a3710-72d2-4811-9a58-911719e90863 https://figgy.princeton.edu/catalog/f86ab145-f8f8-43c7-8460-b5f7400e4005 https://figgy.princeton.edu/catalog/3d955684-1b03-4a67-827c-b920802f8404 https://figgy.princeton.edu/catalog/8d704b8a-6cf2-4a92-9095-ec2c77ad22df https://figgy.princeton.edu/catalog/ceed9ac6-8279-4483-ac43-926a1df27912 https://figgy.princeton.edu/catalog/1a533e65-af8c-42f5-bbcc-2e8d951fa006 https://figgy.princeton.edu/catalog/6400006d-b0e2-46cd-85a1-24d89ab66457 https://figgy.princeton.edu/catalog/37764ee5-c16c-493e-8dd2-d52ab4aa85fd https://figgy.princeton.edu/catalog/c518dc7c-0bfd-4870-992f-084adbc7c8cb https://figgy.princeton.edu/catalog/56c2dc21-9478-405b-98dc-eb98e2c9c7b3 https://figgy.princeton.edu/catalog/6007d21e-5781-4cfd-9fba-dc47149a5bf1 https://figgy.princeton.edu/catalog/babc3840-c182-46a3-bcc3-f634ec4fbcf6 https://figgy.princeton.edu/catalog/76e2d55a-2950-4a4d-a0cc-300eb4e8824d https://figgy.princeton.edu/catalog/ab84fe85-47ef-47db-a3b5-1bb0064da41f https://figgy.princeton.edu/catalog/63d7d954-27a5-4416-a680-cbb1143c0791 https://figgy.princeton.edu/catalog/ffa5bf94-e588-4d90-8e61-e0728352e8b9 https://figgy.princeton.edu/catalog/1d89b2e4-81e7-41f5-821a-ddccf03f75d8 https://figgy.princeton.edu/catalog/88d4d854-0084-44bd-9959-125c62d16186 https://figgy.princeton.edu/catalog/a763ca2a-37ce-4ab7-afd2-040694d5e1ca https://figgy.princeton.edu/catalog/250c76cc-36e5-46f6-86d6-72650d142136 https://figgy.princeton.edu/catalog/e541b0a2-a9a5-4156-b86e-7eaff9c427a3 https://figgy.princeton.edu/catalog/174834c7-0ba8-41a6-880c-680c52a502b5 https://figgy.princeton.edu/catalog/7ec25438-2f8a-4a4c-b7ff-f0ec5505067e https://figgy.princeton.edu/catalog/8eca94cb-d750-4443-9b8f-feab23035189 https://figgy.princeton.edu/catalog/22c6a6ac-de6c-486f-a848-7659af98747f

Sudden Priority Justification

We're the only ones that can fix this, regenerate derivatives doesn't seem to work, Tom's work is blocked.

Fix Options

We're going to ticket the project and think about it for later, and for this sudden priority do the first option.

First Step

Pick one resource, write a script to remove all non-pdf members and then regenerate derivatives, and run it on that one resource and see if it fixes it.

New First Step

Find a broken one and manually run the PDF derivative process on it and see what happens.

hackartisan commented 1 month ago

Here is the list of ids formatted to paste into a terminal:

["3931aa55-afd6-4c29-88bc-afaac7251f38",
"8e47ebd4-32a0-46a9-84ba-a5360b35f91b",
"3c289f5b-2584-416b-90d0-9b6252897754",
"79e504c4-70e1-4e18-ad69-ecb99868f587",
"8aa5af1f-b6c9-48d8-8222-8f9634d8896b",
"c27e01c8-e0f4-4865-9182-c5a145cca052",
"9ab03dca-14bd-431c-ab3a-902d19cb4ac5",
"e4353eec-ff2f-4f8a-8a3e-9cea466cb75f",
"f2440b6f-bd8f-418c-81ff-13bf146f71fd",
"0e1c7500-cc38-429e-9c28-336d0b73f72d",
"7c617e34-a827-413b-ada7-151b327bcc70",
"3b5f3f58-2a71-404d-acf5-3b8227e4c58d",
"a66a0ec3-8413-48bd-ba48-44cc7bea0dd4",
"3b850ae4-08be-4b85-bbaa-2d88b00ec740",
"4708281f-734a-41d6-8324-c360b9fde8ac",
"eaab6aff-1a83-4694-b817-a66adf9a0f4d",
"4c09654f-b477-4d69-82cd-501fd9759d65",
"28281766-a393-4539-bcc7-84968eea0024",
"3d54455e-c823-4d97-94c5-985f7da0b41a",
"47d480bd-cd31-4858-b4cd-0a2f3d3007be",
"5344efb6-2ae2-4f68-bc58-b27b08748846",
"493d5f33-f524-464b-8d7b-5819dbfe612f",
"d67a3710-72d2-4811-9a58-911719e90863",
"f86ab145-f8f8-43c7-8460-b5f7400e4005",
"3d955684-1b03-4a67-827c-b920802f8404",
"8d704b8a-6cf2-4a92-9095-ec2c77ad22df",
"ceed9ac6-8279-4483-ac43-926a1df27912",
"1a533e65-af8c-42f5-bbcc-2e8d951fa006",
"6400006d-b0e2-46cd-85a1-24d89ab66457",
"37764ee5-c16c-493e-8dd2-d52ab4aa85fd",
"c518dc7c-0bfd-4870-992f-084adbc7c8cb",
"56c2dc21-9478-405b-98dc-eb98e2c9c7b3",
"6007d21e-5781-4cfd-9fba-dc47149a5bf1",
"babc3840-c182-46a3-bcc3-f634ec4fbcf6",
"76e2d55a-2950-4a4d-a0cc-300eb4e8824d",
"ab84fe85-47ef-47db-a3b5-1bb0064da41f",
"63d7d954-27a5-4416-a680-cbb1143c0791",
"ffa5bf94-e588-4d90-8e61-e0728352e8b9",
"1d89b2e4-81e7-41f5-821a-ddccf03f75d8",
"88d4d854-0084-44bd-9959-125c62d16186",
"a763ca2a-37ce-4ab7-afd2-040694d5e1ca",
"250c76cc-36e5-46f6-86d6-72650d142136",
"e541b0a2-a9a5-4156-b86e-7eaff9c427a3",
"174834c7-0ba8-41a6-880c-680c52a502b5",
"7ec25438-2f8a-4a4c-b7ff-f0ec5505067e",
"8eca94cb-d750-4443-9b8f-feab23035189",
"22c6a6ac-de6c-486f-a848-7659af98747f"
]
hackartisan commented 1 month ago

After deploying #6405 the regenerate derivatives job works as expected (I ran it on https://figgy.princeton.edu/catalog/8e47ebd4-32a0-46a9-84ba-a5360b35f91b). I've just enqueued regenerate derivatives job for the first fileset of each of the other resources listed here.

hackartisan commented 1 month ago

The duplicate files issue appears to be resolved. However, the process that converts the pdfs to tiff seems to be generating a lot of blank pages, which are erroring on characterization and so the resources still aren't appearing correctly.

eliotjordan commented 1 month ago
tpendragon commented 2 weeks ago

Blocked on #6436

tpendragon commented 1 week ago

I regenerated everything given with #6436 and I think it worked, actually. All came through, all the above are green. I'm closing.