Proposing a CleanerJob to delete unused data

npwynands commented 3 years ago

Hey, I have an idea for a Job supplementing this recipe collection. I've been talking about this with my supervisor, Willi, and he told me to discuss this here because you might be also interested in this. Please note that I am actually working on MT and not ASR. Further, I've been using an MT version of this recipe repo, which apparently has been forked many many years ago. So, I might not be that familiar with updated mechanics of this code base and ASR methodology. Anyway, my idea is independent of these issues.

My motivation is that I always have the issue that the checkpoints of my models consume a large part of my hard drive's storage space. Often, I train dozens of models in parallel which makes my storage space reach its capacity rapidly. The thing is, I don't need all of my models' checkpoints. At any time, I only want to keep the "best" checkpoint of each model, i.e., the checkpoint with the best score (e.g. BLEU, TER, WER, what ever you like). This means, at each checkpoint, I can delete all preceding ones, if the checkpoint is the best so far and if it isn't I can directly delete the checkpoint itself. So, I've been thinking of an algorithm which automates this very procedure for me.

My idea is to implement this by appending a Job, let's call it CleanerJob, at the end of each checkpoint evaluation pipeline. Each CleanerJob is given a checkpoint, the checkpoint's score, as well as the CleanerJob of the preceding checkpoint. Given the score, the CleanerJob can assess whether it has the best checkpoint compared to those of its predecessors so far, and hence whether to delete its own or all of its predecessors' checkpoints. Besides the checkpoint, we might also tell the CleanerJob to delete the evaluation pipeline, since this becomes dead data too. The concept is illustrated in this figure here:

grafik

This was my initial idea. However, I would like to design the CleanerJob such that it becomes as generic as possible. First of all, it must be possible to use any metric for the score, but then, why not providing the possibility to set custom delete conditions in the first place? Also, as mentioned earlier, the data to delete should be customizable. Further, the CleanerJob must not be restricted to deleting checkpoints of course, it may be designed to delete any dead data.

I've lots of ideas what could be done with this. But before I'm going to implement this, I would like to hear, if there are any comments, wishes, supplementary ideas, etc. from your side. Maybe you have special requirements I need to consider, maybe something ASR-related which I might not be aware of? I would appreciate any feedback!

JackTemaki commented 3 years ago

Hey @npwynands, for your problem there are already existing solutions, which might not do exactly what you want, but at least similar things.

Variant 1 - use the cleanup function from RETURNN itself: with cleanup_old_models=True RETURNN will (per default) keep the best 4 checkpoints and the last 4 checkpoints based on the default dev_score that is also used for the learning rate reduction.

Variant 2 - use GetBestEopchJob: unfortunately this job is not yet part of the official i6_core, but this jobs takes a finished model folder (no matter if it already uses internal cleanup or not), and can give you the "n-th" best model based on the key you give. It can be set to either symlink or copy the checkpoint, so in case you have 4 jobs for the 1st to 4th best model, you could all models in your training folder.

Variant 3 - custom scripts: some people implemented their own scripts that do cleaning independent of Sisyphus

Variant 4 - yours: here I think it is difficult to make it generic enough, because it sounds like a more specific problem with respect to RETURNN checkpoints. But what actually exists (not pushed yet) is a MultiCleanupJob which basically deletes jobs when some given job is finished. I used this for pipelines like "Synthesize Features -> Convert to Audio -> create bliss -> create ogg-zip", and then deleted everything except the ogg-zip automatically. But yes this does not help here of course, because it is important to keep the training folder to look at the config. There is definitely no simple solution for your problem, and in any case managing the Sisyphus jobs dependencies has to be done carefully, so that you do not delete checkpoints too early.

michelwi commented 3 years ago

The problem is, that this kind of job violates the basic assumption of Sisyphus: The graph, once defined is static and the outputs of every finished job exist. Within the recipes we can only try to trick the mechanic of Sisyphus into running a Job, that deletes other jobs outputs within its run method.

Some sketch of my idea:

ConditionedDeleteJob(job):
    def __init__(self, files_to_delete, condition, unused_inputs):
        """
        param list[Path] files_to_delete: outputs to be deleted
        param tk.delayedBool condition: condition under which files are to be deleted
        param list[Path] unused_inputs: additional inputs to delay execution of this job until other jobs are finished
        """
        pass  # TODO set members

    def run(self):
        if self.condition.get():
            for path in self.files_to_delete:
                os.unlink(tk.uncached_path(path))

then we could do something like ConditionedDeleteJob(lattice_caches, True, ScoringJob.out_reports) (i.e. always delete lattices when the scoring is finished) or ConditionedDeleteJob(model_path, (WER > current_best_WER)) (I admit there is still a lot to figure out in my examples^^)

npwynands commented 3 years ago

Hey @JackTemaki thank you for the quick reply!

Yes, I'm aware that RETURNN provides these checkpoint cleaning mechanics. However, as you mentioned, it uses the RETURNN internal dev_score. As far as I know it is not possible to make RETURNN consider external scores for the checkpoint selection. Further, if I would enable RETURNN's cleanup mechanics in addition, there would be danger that RETURNN deletes a checkpoint which turns out to be best according the external metric I use. Also there is the possibility that RETURNN deletes checkpoints before they could be evaluated by my evaluation pipeline, e.g. if those Jobs are dispatched late.

As you said, variant 2 operates on finished model folders. In my case, I have multiple models, which are unable to finish before their checkpoints exceed the capacity of my hard drive. Thus, variant 2 would not help me here, I need a solution which deletes my checkpoints during training, therefore my idea to insert CleanerJob nodes into the Sisyphus graph.

Yeah, I also wrote myself various cleaner scripts, which operate independent of Sisyphus. But those were always very use case specific. I believe, a CleanerJob acting as a cleanup step/node within the Sisyphus graph could be a more general solution here.

My problem might sound RETURNN-specific because it is my initial point, but actually I have a more general view on this. For instance, your example of deleting dead data generated during pre-processing would be an issue which I imagine can also solved by my concept. Regarding this, to me it sounds like the MultiCleanupJob you are talking about goes in the direction I'm heading. May I look into this? You could send me a link via Slack. And yeah, don't messing with the Sisyphus Job dependencies is another issue I must busy myself with. I have an eye on this.

michelwi commented 3 years ago

May I add another Variant to the List of @JackTemaki:

Variant 5 - use Sisyphus cleaner: There is (was?) a tk.cleaner in Sisyphus that can delete finished Jobs based on a keep_value. It has to be run manually (in the console?) and is restricted to finished Jobs but the keep_value can be set conditionally (e.g. on the Scorer.wer()) in the recipes.

curufinwe commented 3 years ago

I think solving this in the recipes is the incorrect approach. Jobs should be standalone and only operate within their own folder. Otherwise it's hard to know in what state the graph currently is (sisyphus thinks the job is done, but actually the output is missing). The in my view more correct approach here is to include this functionality in sisyphus itself.

Let's start on the Job level, i.e. cleaning up jobs that are no longer needed. I would suggest adding a flag to the job (i.e. _sis_delete_after_use). Sisyphus would delete that whole job once all jobs that depend on that job are done.

If deletion is too much and only some outputs should go away we could add a new state: "compacted". Each job has to define for itself what it means to compact it (usually something like deleting all large artifacts but leaving log-files alone). If for some reason a new job appears that wants the output of a compacted job sisyphus could ask the user what to do, if the job should be deleted and rerun or whether to stop execution to give the user the chance to inspect the situation manually.

As these two behavior are mutually exclusive the flag to control this should be something like _sis_on_dependencies_finished with possible values \in {keep, delete, compact}.

@critias what do you think?

curufinwe commented 3 years ago

P.S. this compacting behavior could also be used to remove all but the best-n checkpoints (but only after all recognition jobs have finished).

michelwi commented 3 years ago

P.S. this compacting behavior could also be used to remove all but the best-n checkpoints (but only after all recognition jobs have finished).

but only if the criterion is defined in the Job itself (e.g. CE loss of checkpoint). If we want to remove based on WER, then we depend on the output of the ScorerJob.

JackTemaki commented 3 years ago

but only if the criterion is defined in the Job itself (e.g. CE loss of checkpoint). If we want to remove based on WER, then we depend on the output of the ScorerJob.

This is why I like the idea to have an extra job that copies the wanted checkpoints. Then the "compact" job could still only do internal cleanup, but some additional checkpoints would be saved in another job.

Edit: This would even allow to "wipe" all training jobs, and still the best checkpoints would survive

michelwi commented 3 years ago

Well then the next thing on my bucket list is a continuable ReturnnTrainingJob, which should be able to transition from the state compacted (but last epoch kept) to runable once a later epoch is requested.

How about starting a separate ReturnnTrainingJob per epoch? Then it would be trivial to add additional epochs later and each epochs Job would be finished once the epoch is done. Some problems would be 1) annoying to schedule on the cluster and 2) inconsistent optimizer state.

JackTemaki commented 3 years ago

I think we had continueable jobs in the past (for MT)... But at least I am always starting a new job when I want to continue. And resetting the optimizer might be actually better than worse for the performance...

curufinwe commented 3 years ago

Let's start with the sisyphus modifications first and worry about the training later. The current job is continue-able in the sense that you can manually transition it out of the finished state and start it again. But having that in sisyphus would also be nice.

critias commented 3 years ago

Sorry for the late response. I missed this discussion.

So by now, there are multiple things going on here:

Continued neural network training: When you create a task, you can set continuable = True. That task will never be marked as finished and will be resubmitted if an output path is requested that isn't computed yet. It isn't working as stable as I would like it to be, but it works.
I like the idea of a delete_after_use marker for jobs and don't see any problems adding one.
About the main question: Only removing some outputs from a Sisyphus job can be problematic since one of the basic assumptions of Sisyphus is that all outputs are available once a Job is finished. Nevertheless, we are currently misusing the register_report function to create an overview of the job progress and clean up weaker models at the same time. It didn't cause any problems for us so far, but it could if these now missing models would be requested. Right now, we keep the best 20 models by default and make sure only to remove the models once all translations are done. The approach by @JackTemaki to have a separate job where the X best models can be moved to is the cleanest version, but has the downside that the models can only be removed once the whole training is completed and continuing the training in the old folder isn't possible anymore. The dependency between jobs and paths is already loosened a bit since a path can be marked as finished while the job is still running, but it's not trivial to tell if a path is still missing or already deleted if it's requested. A way out could be to keep track of which outputs were deleted and let Sisyphus complain if one of them is requested. I'll have to think about this problem a bit to see if I can suggest a better solution for it.

JackTemaki commented 1 year ago

Closed due to inactivity, it seems there is currently no urgent need for such job.

rwth-i6 / i6_core

Proposing a CleanerJob to delete unused data #97