scientist-softserv / adventist_knapsack

Apache License 2.0
2 stars 0 forks source link

SDAPI: UV not displaying PDFs #341

Open KatharineV opened 1 year ago

KatharineV commented 1 year ago

On the SDAPI tenant (production), PDFs attached since 7/17/23 have never rendered in the UV. Jobs are running on /jobs, but I can't tell what I'm looking at, and I'm curious if the jobs should ever be this backed up. We have ten days worth of PDFs that have been attached but aren't rendering. You can see example works by visiting this search. Every work with a PDF attached back to page 25 of the search is showing a thumbnail but nothing where the UV should be. There is just an icon.

Example of a work where the PDF was added on 7/17: https://sdapi.b2.adventistdigitallibrary.org/concern/journal_articles/13163964_coming_of_age_learning_disabilities_at_the_postsecondary_level?locale=en

Example of a work where the PDF was added on 7/26: https://sdapi.b2.adventistdigitallibrary.org/concern/journal_articles/13416725_profile_2004_k_12_teacher_perceptions_of_adventist_curriculum?locale=en

NOTES

maybe helpful/related? https://github.com/scientist-softserv/adventist-dl/blob/3f8026104e7015071741de989aefe6976eac16b0/app/views/hyrax/base/_representative_media.html.erb

ShanaLMoore commented 1 year ago

I don't think it's displaying because the work doesn't look like it got split by iiif_print 🤔 or perhaps they were split but didn't get relationships formed.

I pulled the prod sha locally and confirmed that splitting is happening.

laritakr commented 1 year ago

Notes (since the repo is not responding and can't currently work on this):

Unsure how to track where the creation failed for the child works.

ShanaLMoore commented 1 year ago

The job supposed to create those works. There's an exception happening that is not bubbling up. So it "failed" even though there's no trace of it.

Retries need to report its failure to the parent.

Was successful when deleting the file set, re building an entry, and re indexing the file set.

jillpe commented 1 year ago

To Do:

To remediate:

ShanaLMoore commented 1 year ago

a script like this may be helpful to correct the bad data

laritakr commented 1 year ago

Edit: the jobs ran correctly, so the details below are not an issue.

~I noticed a potential issue that could affect anyone using GoodJob.:~ ~The good jobs prioritization seems to be acting strangely and COULD be part of why works sometimes need to get reindexed or saved after PDF splitting. The CreateRelationshipsJob priority (-20) is lower than AttachFilesToWorkJob (-1), yet it is in the queue above it. And if the relationships are attempted before the files are attached, it causes problems, which is WHY we set priorities as we did.~ ~There is a potential that the gem recognizes the priority at the point where the jobs are moved to active, but this should be investigated to be certain, as other jobs are being scheduled and appear above the relationships job immediately.~ ![Screenshot 2023-08-08 at 12 44 10 PM](https://github.com/scientist-softserv/adventist-dl/assets/17851674/9d10a5f2-ee83-40f1-bf0a-7e29c96bfeaf)
laritakr commented 1 year ago

Job started 8/11/23 at 10:45 am EDT.

Image

laritakr commented 1 year ago

The job ended without any error but apparently did not complete, as the logs disappeared. However, it doesn't seem to handle the situation needed for these works. These PDFs were attached separately, after the original ingest.

Reingesting from the bulkrax entry apparently does not attach the remote pdf as a fileset, so running it over these works would just remove the PDF and not reattach it. At this point, the file sets seem to split when they are added, so there are a few options to handle these: 1) delete the PDF file set and re-upload it to each work again manually. 2) come up with a process to split an already-uploaded work (which is a longer-term goal, but definitely not something we can do quickly) 3) Find a way for Bulkrax to create the file_set from the remote url for these. It appears that bulkrax now has remote files, and that seems like it would be possible.

KatharineV commented 1 year ago

You're absolutely right about the Bulkrax upload for works in this OAI set. The remote files for SDAPI were meant to only load as a link, not attach the file. If it could be reconfigured to actually attach the PDFs that are referenced by the remote file field in the OAI feed, we would be very interested in that (unless it removes PDFs attached through the UI--see below). I can also see us paying for development hours toward option 2. If we could run something that splits already-uploaded works, that would be extremely helpful for our ADL tenant.

I'm intrigued to learn that running Bulkrax over the existing works would remove a PDF we manually added after the fact. If that's the case, then we don't want to rerun Bulkrax, because we've been manually adding PDFs to several works on the SDAPI tenant, and we don't want to lose that effort. Would that behavior definitely be the case?

Thanks for considering these complexities!

laritakr commented 1 year ago

I'm intrigued to learn that running Bulkrax over the existing works would remove a PDF we manually added after the fact. If that's the case, then we don't want to rerun Bulkrax, because we've been manually adding PDFs to several works on the SDAPI tenant, and we don't want to lose that effort. Would that behavior definitely be the case?

The PDF was removed by my job because that was needed to re-run the split. The files only split as a file is connected.

This removal is something unique to the job that I put together, both because it was necessary to rerun the split and because I assumed the file originally came from the ingest and would get re-ingested. It is not the case for a normal reingest.

KatharineV commented 1 year ago

Team, perhaps this same issue is affecting the ADL tenant as well as SDAPI. There is a collection that we uploaded in mid-June, and most of the works have attached PDFs that didn't render. Some of the works seem corrupted and won't even load/open. Could it be a similar problem? If not, please let me know and I'll create a new ticket. It seems related so I thought I'd start here.

Collection with works where most of the PDFs didn't split and/or don't render in the UV (see any work from Box 1-4): https://adl.b2.adventistdigitallibrary.org/collections/b8693090-6973-4d0e-b05a-7cc1773a80d8?locale=en

Sample work with attached PDF that didn't split: https://adl.b2.adventistdigitallibrary.org/concern/generic_works/c305_b002_f02_box_2_fld_2_mar_to_apr_1921?locale=en

Sample work where the PDF didn't even attach (although I can verify that I tried to attach one as recently as 8/22 and when I tried to save I got a page error and the file has never appeared): https://adl.b2.adventistdigitallibrary.org/concern/generic_works/c305_b004_f02_box_4_fld_2_aug_to_sep_1923?locale=en

Sample of works that won't even load the work page in browser:

Child work that exists on the work page for one of the works above that won't even open in the browser: https://adl.b2.adventistdigitallibrary.org/concern/generic_works/73d619e3-03bb-4d52-a1e6-bb994890b49e?locale=en

Sample work where the PDF split but the child works aren't attached to the parent. They're visible on the dashboard Works page (example).

Something weird is definitely going on, and I'll be grateful for your help to straighten it up sometime.

Thanks!

laritakr commented 1 year ago

We have a ticket for cleaning up works in ADL... I had started it a while back and there are a number of different issues. Some were not fully indexed. Some split but didn't get EVERY page created so they didn't connect to the parent, leaving the stray pages out there. There were so many different issues, and I hadn't fully identified them all when I got pulled off to other work.

As far as I know, the majority of the issues with SDAPI would resolve if the PDF is removed and re-uploaded... the PDFs were uploaded during a time when the system wasn't responded normally. But in time, we will likely have a more automated way to reimport, so I'm not suggesting you do that much manual work at this point unless you choose to.

KatharineV commented 1 year ago

Thanks, @laritakr , that makes a lot of sense!