Closed barbarahui closed 1 month ago
Hey @barbarahui - thanks, the deletion code looks good. I think we actually still want these subprocess conversion errors to be non-blocking to the rest of the job, though - could you please revert the second of these two commits?
@amywieliczka Do we want to add some code to handle the error somewhere? At the moment no error is raised and the job succeeds even if a subprocess throws an error.
I'll go ahead and revert the second commit so we can get the file deletion code through and add an issue for the subprocess error handling.
@barbarahui I know, but that's the intended behavior at the request of the rest of the team - if you look at the blame on that line that returns None rather than raising an error, it's about making it non-blocking. The job should succeed even if the subprocess fails. A message should get printed to the logs, though.
In the Calisphere UI, iirc, this results in an icon being used instead of the thumbnail, which is another way it should get caught during QA. Once we're on Airflow 2.9 it should be a bit easier to find which mapped Airflow task ran which page, and trace back to the logs to understand why there's no thumbnail.
I did try to implement some rollup reporting, but due to the iffy-at-best return values from a RunTask operator, found it prohibitively difficult to do so. I think there's an outstanding issue about it somewhere. I think it bears thinking about in a larger context rather than shoehorning into this particular bug fix.
@amywieliczka Ahhh gotcha. I kind of remember this now but I think I might've been off doing aspace or nuxeo things. Thanks for explaining!!
The disk is filling up during content harvesting. I think the file removal statements in this bit of code got dropped during the recent caching implementation: https://github.com/ucldc/rikolti/blob/9f7be231bd24dbb8b5e34e8b1c7554e59e4e5ada/content_harvester/by_record.py#L149-L156
The code in this PR removes source files and derivative files from local disk once they're no longer needed. I tried to find a neater way to remove the files all at once, i.e. at the end of
harvest_record_content
, but it proved tricky.I also made a change to
derivatives.subprocess_exception_handler
so that it raises an error instead of returning None if there is an error. As was, the DAG was succeeding even when, for example, thepdf_to_thumb
subprocess was failing. It seemed like the return value ofNone
was not meant to be permanent, given the fact thatraise(e)
was commented out?