scientist-softserv / adventist_knapsack

Apache License 2.0
1 stars 0 forks source link

PDF.js not loading PDFs prior to splitting #180

Closed KatharineV closed 1 month ago

KatharineV commented 2 months ago

After deploying Knapsack, ADL and SDAPI are not loading PDF works in the PDF.js viewer prior to splitting. The PDF.js viewer should display PDFs immediately, and the UV should display them after relationship jobs and splitting run later.

This bug is affected works uploaded before and after the cutover to Knapsack. Works uploaded before Knapsack are showing a gray box on the page where PDF.js should render, with an error message instead of the PDF. Works uploaded after Knapsack are showing the PDF icon and no viewer at all.

Production has this error. Staging looks like it hasn't cut over the Knapsack, based on the appearance of the dashboard. So I haven't testing staging as much.

Old Works not currently showing PDF.js (uploaded 2022 to the present):

ADL prod: https://adl.b2.adventistdigitallibrary.org/concern/published_works/a8d40de8-a116-4bd5-890a-98edf027b44d Image

New Works not currently showing PDF.js (works uploaded 4/11 and 4/12): The ADL production example was imported from the OAI set. The SDAPI prod example was uploaded with CSV + Files using Bulkrax.

SDAPI prod: https://sdapi.b2.adventistdigitallibrary.org/concern/journal_articles/dialogue_2023_1_9_13_a_sower_went_out_to_sow Image

ADL prod: https://adl.b2.adventistdigitallibrary.org/concern/published_works/20186590_the_worker_s_bulletin_january_15_1918 Image

ShanaLMoore commented 2 months ago

Locally created an Image work type and attached a pdf file:

Image

Shouldn't we be hitting iiif print? in the stack trace. that iiif_viewer? doesn't call iiif_media as shown in the error above.

Image

ShanaLMoore commented 2 months ago

After fixing the above error (should we create a separate ticket?), this PR may solve this issue.

ShanaLMoore commented 2 months ago

Oddly, after a few page refreshes I also see this error.

Putting my debugger in Hyku::WorkShowPresenterDecorator#video_embed_viewer? isn't stopping, as if the decorator file isn't getting picked up. Hyrax::ImagePresenter < Hyku::WorkShowPresenter which is what the knapsack should be decorating.

Image

kirkkwang commented 2 months ago

@KatharineV the three examples you provided are now displaying the viewer. We had to reindex the file sets. Still not 100% sure what happened but it could have been just bad timing. I feel that if we initiate a reindex of everything, it'll fix a lot of these issues though that might be overkill since we still don't know if it's just a bad timing thing isolated to the date of the cutover to Knapsack.

KatharineV commented 2 months ago

Team, I see improved functionality with PDF.js, but we aren't back to normal yet. I see a work uploaded on 4/16 is displaying the PDF.js viewer, but the progress bar spins for several minutes without (so far) displaying the PDF.

https://adl.b2.adventistdigitallibrary.org/concern/published_works/22266526_cardinal_2015

Image

This is a work that has not split, so I expect it to load quickly and be viewable in the PDF.js viewer. That behavior is not happening as expected.

KatharineV commented 2 months ago

For comparison, here's a work that was uploaded several years ago (12/21), and it behaves as expected.

https://adl.b2.adventistdigitallibrary.org/concern/published_works/20122750_the_home_missionary_december_1_1889

The PDF.js viewer loads the document almost instantaneously, and it can be fully read.

KatharineV commented 2 months ago

Team, the behavior on the SDAPI tenant is behind ADL. The PDF.js viewer is still not loading on SDAPI, and neither is the UV. Yikes! Thanks for helping us fix this. Here's a sample work that doesn't show any in-browser viewer, despite the Items list including the full PDF and the split child works:

https://sdapi.b2.adventistdigitallibrary.org/concern/journal_articles/dialogue_2023_2_18_20_hearing_god_above_the_headlines

Image

Image

ShanaLMoore commented 2 months ago

Noting that the file in question is very large, over a gig in size. Wondering if that's part of the issue. digging in 👀

kirkkwang commented 2 months ago

This one looked like a reindex issue again, i reindexed this particular work to get it to show but we'll likely need to do a file set reindex to hit all of them

image
ShanaLMoore commented 2 months ago

@kirkkwang that's a great find but I wonder what's causing this. We've seen it multiple times now with newly added works. Why isn't the indexing happening?

kirkkwang commented 2 months ago

@ShanaLMoore this work was ingested on the 12th, that was the cutover day right? I wonder if that timing had something to do with it.

ShanaLMoore commented 2 months ago

This still seems like it's an issue. If the PDF.js is supposed to render while we wait on the split to switchover to the UV, here is what I'm seeing: (a blank UV while we wait for the children's jobs to complete)

Image

ShanaLMoore commented 2 months ago

Disregard my previous comment. The PDF.js rendering first is only configured to happen with remote urls from s3 (ie: this was set up specifically as a workaround for the preprocessed files of the mass migrations.) Manually created works are not expected to render the PDF.js first, per Kirk and LaRita.

cc @KatharineV Could you confirm if this is still an issue for you?

KatharineV commented 2 months ago

@ShanaLMoore Yeah! I do recall the PDF.js workaround being implemented specifically to account for the mass migration failing to split fast enough for the UV. However, the PDF.js has been working (pre-Knapsack) for all files uploaded to our sites. Not just new S3-linked uploads. And based on my testing today, PDF.js is indeed working in that more expansive capacity--it is rendering everything we need it to render on production right now. So, thanks for the work on this ticket, even if it crept beyond the scope of what LaRita and Kirk intended for PDF.js in the beginning! The PDF.js band-aid is actually much more useful than expected. It's providing a critical service, and it's working again as it was prior to Knapsack.

Here's the tl;dr of functionality I'm seeing and that I'd like to keep, regardless of the intentions of the workaround...

PDF.js works for new works uploaded via the UI (example: uploaded yesterday) or CSV/zip files (example: uploaded today). All PDFs render in the PDF.js before they split and display in the UV. Files uploaded prior to the team installing a PDF viewer in our application are now rendering in PDF.js, since they won't split until we tell them to (example: one of the first works loaded to Hyku back in 2021). The way the behavior has been working from installation of PDF.js to Knapsack is more fully realized than the behavior I saw when I created this ticket. We have that full functionality back in place as of my testing today. I understand if the intention was more narrow, but the functionality we received in the end is greater, and I hate to see any part of this platform move backwards in functionality. We need a way to read every PDF in the repository or we can't launch this site, and the work you've done to get PDF.js to where it is today has solved that problem. I hope it's alright to attach our hopes to keeping this functionality exactly as it is, at least! haha

Screenshot showing PDF.js for the work uploaded today via files & CSV, which demonstrates that PDF.js is working for more than S3 links via OAI: image