This delete_callback is meant to be used as a callback when adding an item to the ZIM, typically to delete original file once it has been added to the ZIM.
There are some edge cases where the file to delete might already be gone, typically when the scraper encounters an Exception and decides to stop ZIM creation and delete all temporary files on exit. Due to the concurrency of these operations, the deletion of temporary files on exit might complete before the delete_callback is invoked.
I think that we should enhance the delete_callback to either:
always silently ignore when the file to delete is already gone
provide a parameter to activate the silent ignore of missing file
Solution 2 would help in the sense that some scrapers might have silent bugs if we silently ignore deletion issues, leading to huge disk space consumption for nothing. However the edge cases causing the issue mentioned where file is already gone are pretty rare and difficult to produce, so I'm more in favor of solution 1 since I would assume most scraper developers will never realize such situations might appear until hit by them (probably in production).
This issue has been originally unveiled by @dan-niles in Youtube scraper, kudos.
We have a helper delete_callback at https://github.com/openzim/python-scraperlib/blob/335d5271e106b374f1aca871d19557ff2c81582d/src/zimscraperlib/filesystem.py#L47
This delete_callback is meant to be used as a callback when adding an item to the ZIM, typically to delete original file once it has been added to the ZIM.
There are some edge cases where the file to delete might already be gone, typically when the scraper encounters an Exception and decides to stop ZIM creation and delete all temporary files on exit. Due to the concurrency of these operations, the deletion of temporary files on exit might complete before the delete_callback is invoked.
I think that we should enhance the delete_callback to either:
Solution 2 would help in the sense that some scrapers might have silent bugs if we silently ignore deletion issues, leading to huge disk space consumption for nothing. However the edge cases causing the issue mentioned where file is already gone are pretty rare and difficult to produce, so I'm more in favor of solution 1 since I would assume most scraper developers will never realize such situations might appear until hit by them (probably in production).
This issue has been originally unveiled by @dan-niles in Youtube scraper, kudos.