learningequality / ricecooker

Python library for creating Kolibri channels and uploading to Studio
https://ricecooker.readthedocs.io/
MIT License
18 stars 53 forks source link

Ricecooker does not clean up temp files and directories #187

Open kollivier opened 6 years ago

kollivier commented 6 years ago

Description

Forgot to file this as an issue when I ran into it, and just made a mental note to look into it later. Basically, ricecooker (and hence chefs by default) do not clean up temp files and directories, which can cause them to bloat considerably over time.

In particular, create_predictable_zip creates a zip file in the temp folder that is never deleted. What makes this particularly problematic is that these temp files remain until the OS cleans them up, which on Mac appears to be upon reboot. While debugging a PraDigi issue, I ended up filling up my HDD space after a couple runs because each run would create several GB of temp files and leave them on my drive.

The simplest solution is to have ricecooker create a root temp directory on start, and provide access to this directory to chefs (e.g. ricecooker.get_temp_dir()) so that they can store all the temp files they create inside it. Then, even if the chef errors out, we wipe out this temp directory before exiting. (Maybe an @atexit handler?)

What I Did

Ran the PraDigi chef locally on OS X then inspect my temp files directory.

ivanistheone commented 5 years ago

Yeah this would be good to fix. There are two cases:

  1. "one-off" temp files that we don't want to keep [especially ones auto-generated by ricecooker utils]
  2. files created in pre_run that we do want to keep---e.g. intermediate states for files we download and process locally before uploading. These shouldn't be deleted atexit, but should be deleted if --update directive is given. Currently this --update logic has to be done by chef.

Related to 2., I opened this to maybe turn it into a general purpose thing: https://github.com/learningequality/ricecooker/issues/138

And something to watch out for RE --update logic, many of the chefs use "forever caching" like this: https://github.com/learningequality/sushi-chef-pradigi/blob/master/chef.py#L69-L78 This is very useful when debugging, but we need to make chefs in production have a more reasonable expiry logic—otherwise we'll never re-scrape sites.

ivanistheone commented 4 years ago

Here is some recent chef code that sets the TMPDIR env variable which tmpfile package respcts, and as a result all the chef's temporary files are put into chefdata/tmp.

https://github.com/learningequality/sushi-chef-african-storybook/blob/master/chef.py#L31-L34

I don't see any blockers for this being the default behaviour for ALL chefs (implemented in ricecooker), possibly with a way to override somehow if chef author needs tmp files to go elsewhere...

The location chefdata/tmp seems like a good fit but open for other names?

Note also there are several places in the ricecooker code that don't delete tmp files, so now would be a good time to fixe those with finally blocks or other cleanups.