openzim / python-scraperlib

Collection of Python code to re-use across Python-based scrapers
GNU General Public License v3.0
19 stars 16 forks source link

delete_callback should support to ignore silently the fact that fpath is missing #192

Open benoit74 opened 2 months ago

benoit74 commented 2 months ago

We have a helper delete_callback at https://github.com/openzim/python-scraperlib/blob/335d5271e106b374f1aca871d19557ff2c81582d/src/zimscraperlib/filesystem.py#L47

This delete_callback is meant to be used as a callback when adding an item to the ZIM, typically to delete original file once it has been added to the ZIM.

There are some edge cases where the file to delete might already be gone, typically when the scraper encounters an Exception and decides to stop ZIM creation and delete all temporary files on exit. Due to the concurrency of these operations, the deletion of temporary files on exit might complete before the delete_callback is invoked.

I think that we should enhance the delete_callback to either:

  1. always silently ignore when the file to delete is already gone
  2. provide a parameter to activate the silent ignore of missing file

Solution 2 would help in the sense that some scrapers might have silent bugs if we silently ignore deletion issues, leading to huge disk space consumption for nothing. However the edge cases causing the issue mentioned where file is already gone are pretty rare and difficult to produce, so I'm more in favor of solution 1 since I would assume most scraper developers will never realize such situations might appear until hit by them (probably in production).

This issue has been originally unveiled by @dan-niles in Youtube scraper, kudos.

rgaudin commented 2 months ago

I'm in line with everything you wrote. A warning in the logs would be enough at first