Open christinklez opened 5 months ago
This object has a 1GB tiff file: https://nuxeo.cdlib.org/nuxeo/nxdoc/default/83cbedca-e4cd-45ec-b2d1-490d6b9fd951/view_documents?conversationId=0NXMAIN
Do we want to support such enormous image files?
@barbarahui Nuxeo should support importing of large TIFFs as the main content file -- but we've previously leaned on Nuxeo to generate a JPEG derivative to fetch. We somehow got a JPEG derivative to display in CSphere prod for this object (simple image Nuxeo object, from UCB Bancroft), last time we harvested; https://calisphere.org/item/83cbedca-e4cd-45ec-b2d1-490d6b9fd951/ . Did the derivative-generation process somehow fail?
That said -- @christinklez is this one we could just ETL for the time being?
@aturner we can't ETL this because it is coming from Nuxeo.
Hmm yeah, I think the legacy deep harvester grabbed the "medium" sized image from Nuxeo rather than the full-sized image. I'll take a look at the rikolti content harvester code to see what's going on.
It's also possible that the legacy deep harvesting infrastructure supported processing larger files. There's also the option to bump up the memory on the ECS workers.
@aturner @christinklez Actually the legacy deep harvester did just grab the main content file. I'm guessing we had to use XL workers for processing this collection/object.
We haven't set up the rikolti harvester to be able to spin up mega-workers just for a specific collection. I guess we could temporarily reconfigure it for just this collection...
@aturner @christinklez It just occurred to me that you should be able to ETL this collection since there's only 1 simple object. It's the complex objects that the ETL mapper doesn't support.
@barbarahui -- Exciting turn of events! I'll test this out now!!!!
@barbarahui @aturner -- ETL works for Nuxeo simple objects: https://calisphere-stage.cdlib.org/item/83cbedca-e4cd-45ec-b2d1-490d6b9fd951/
However, it's not supported by a viewer, and includes a circular link back to the page we're on.
Given that this is a UCB Bancroft collection, and that UCB is working on moving these collections off Nuxeo to replace them with a TIND reharvest, I don't know if it's worth prioritizing this issue at this moment. (Unless we come across other collections with supersized images in Nuxeo.)
Ohh, it's because the frontend determines whether or not the item is hosted based on mapper type. I can update the mapper type in OpenSearch to be nuxeo.nuxeo
for this one record. It's a bit hacky, and if you re-ETL this record it will break the viewer again but I think it's an acceptable compromise for now?
@barbarahui I think that sounds acceptable for this record--I'll record this in our notes! @aturner, what do you think??
This sounds good to me -- as a back-pocket strategy for ETL'ing simple Nuxeo objects into the new index, for the time being!
Sadly this doesn't work. I forgot that since we ETLed the collection, the content harvester doesn't create the jp2 and so there isn't anything to display in the image viewer 👎
I vote for leaving this as is for now, if that's acceptable? As you say above, it's probably not worth doing the work to accommodate this object unless we run into a lot more supersized content files?
I agree that we can move forward with this particular record as is. It's a quirky frontend experience, but it's not completely broken.
Let's consider this a low priority issue for now, and revisit if it happens to come up for other Nuxeo collections. Thank you for looking into this, Barbara!!
Mapper: Nuxeo Collection ID: 26973
Run ID: manual2024-04-24T00:05:08+00:00 Permalink to the log: https://7a8067cb-3b99-477e-a883-7e311175a9b4.c3.us-west-2.airflow.amazonaws.com/log?dag_id=harvest_collection&task_id=content_harvesting.content_harvest&execution_date=2024-04-24T00%3A05%3A08%2B00%3A00&map_index=0 Link to the gridview: https://7a8067cb-3b99-477e-a883-7e311175a9b4.c3.us-west-2.airflow.amazonaws.com/dags/harvest_collection/grid?dag_run_id=manual2024-04-24T00%3A05%3A08%2B00%3A00&task_id=content_harvesting.content_harvest&tab=mapped_tasks&map_index=0&num_runs=365