ucldc / rikolti

calisphere harvester 2.0
BSD 3-Clause "New" or "Revised" License
7 stars 3 forks source link

Content harvest rm tmp files #1124

Closed barbarahui closed 1 month ago

barbarahui commented 1 month ago

The disk is filling up during content harvesting. I think the file removal statements in this bit of code got dropped during the recent caching implementation: https://github.com/ucldc/rikolti/blob/9f7be231bd24dbb8b5e34e8b1c7554e59e4e5ada/content_harvester/by_record.py#L149-L156

The code in this PR removes source files and derivative files from local disk once they're no longer needed. I tried to find a neater way to remove the files all at once, i.e. at the end of harvest_record_content, but it proved tricky.

I also made a change to derivatives.subprocess_exception_handler so that it raises an error instead of returning None if there is an error. As was, the DAG was succeeding even when, for example, the pdf_to_thumb subprocess was failing. It seemed like the return value of None was not meant to be permanent, given the fact that raise(e) was commented out?

amywieliczka commented 1 month ago

Hey @barbarahui - thanks, the deletion code looks good. I think we actually still want these subprocess conversion errors to be non-blocking to the rest of the job, though - could you please revert the second of these two commits?

barbarahui commented 1 month ago

@amywieliczka Do we want to add some code to handle the error somewhere? At the moment no error is raised and the job succeeds even if a subprocess throws an error.

I'll go ahead and revert the second commit so we can get the file deletion code through and add an issue for the subprocess error handling.

amywieliczka commented 1 month ago

@barbarahui I know, but that's the intended behavior at the request of the rest of the team - if you look at the blame on that line that returns None rather than raising an error, it's about making it non-blocking. The job should succeed even if the subprocess fails. A message should get printed to the logs, though.

In the Calisphere UI, iirc, this results in an icon being used instead of the thumbnail, which is another way it should get caught during QA. Once we're on Airflow 2.9 it should be a bit easier to find which mapped Airflow task ran which page, and trace back to the logs to understand why there's no thumbnail.

I did try to implement some rollup reporting, but due to the iffy-at-best return values from a RunTask operator, found it prohibitively difficult to do so. I think there's an outstanding issue about it somewhere. I think it bears thinking about in a larger context rather than shoehorning into this particular bug fix.

amywieliczka commented 1 month ago

Issue: https://github.com/ucldc/rikolti/issues/969

barbarahui commented 1 month ago

@amywieliczka Ahhh gotcha. I kind of remember this now but I think I might've been off doing aspace or nuxeo things. Thanks for explaining!!