Temp files sometimes aren't deleted

neurophysik / jitcdde

Just-in-time compilation for delay differential equations

Other

56 stars 14 forks source link

Temp files sometimes aren't deleted #42

Closed mwappner closed 2 years ago

mwappner commented 2 years ago

A heads up: the following is not a well written issue report since I wasn't able to isolate the cause nor provide a minimal example that reproduces the issue.

Part of the use I give to this library is integrating multiple thousands of systems with slightly different parameters values to draw maps in the parameter space (think bifurcation diagrams, for example). This requires that I run the integrator in parallel using the built-in multiprocessing library.

The issue is, I've noticed that under some circumstances (I'm not sure which) jitcdde fails to delete the temp files created in /tmp, sometimes leaving thousands of them behind. I've noticed this issue in at least two different cases so far

when Ctrl+C-ing to interrupt the integration sometimes only some of the temp files are left behind
when running a large-ish set of parameters (say 30k) eventually execution grinds to a halt and I get the error message -bash: cannot create temp file for here-document: No space left on device, which is supposed to be related to either a full disk or a lack of inodes available, none of which are the case. I do however have around 2k temp files in /tmp created by jitcdde after having run around 6k integrations, so some were deleted.

I don't know if there's a chance that if the integration somehow fails because the system is ill-determined for a specific set of parameter values, the temp files won't be deleted.

I know this is very little info to go off of, but maybe there's a known issue relating to this that can help. If it's any sort of help, I just started noticing this after I started using helpers in my integrations. If there's any other info I can provide that may be of use, just let me know.

jitcdde version: 1.8.1 OS: openSUSE Leap (version 15)

Wrzlprmft commented 2 years ago

Some quick thoughts:

In general, temp files are deleted with the respective JiTCDDE object. If Python’s garbage collector fails to recognise when is a proper time to do this, you might instruct it using del, i.e.:

DDE = jitcdde(…)
…
del DDE

PS: Use __del__ instead. See below.

when Ctrl+C-ing to interrupt the integration sometimes only some of the temp files are left behind

This is more or less to be expected. Cleaning up temp files happens in Python and when you kill Python it cannot clean up after itself.

I don't know if there's a chance that if the integration somehow fails because the system is ill-determined for a specific set of parameter values, the temp files won't be deleted.

Can you catch that exception and clean up with del?

mwappner commented 2 years ago

I manually cleaned the temp directory (to delete all the files that weren't deleted after Ctrl+C-ing) and ran a new simulation while keeping an eye on the temp files, which were always around 7k to 11k running on 100 cores. At some point close to the end of the sim I got an error message

Cannot create temporary file in /tmp/: No space left on device
error: command 'gcc' terminated by signal 6

which I didn't get the earlier times. Inspecting the tmp directory I can see that some of the leftover files are quite new, but some are from the very start of the sim, as if they had failed to be deleted. Inspecting the contents of the oldest directory that wasn't deleted I can extract the parameter values that caused the issue from f_definitions.c and run that same set of parameters separately. The punchline here being that it runs without issue, so I don't know why that file fails to delete.

Is there a case where the garbage collector may fail to... well collect the garbage? I'll try adding a del DDE to my code and report back with the results.

Thanks for the help :)

Wrzlprmft commented 2 years ago

Cannot create temporary file in /tmp/: No space left on device
error: command 'gcc' terminated by signal 6

which I didn't get the earlier times.

Well, if you run out of memory, it is more or less random which of the many processes that need memory is the final straw. Sometimes it’s creating a tempory directory; sometimes it’s creating the source code file in that directory; and sometimes it’s the compiled file.

Is there a case where the garbage collector may fail to... well collect the garbage?

I am far from an expert on how Python’s garbage collector works, but it’s certainly not perfect and to some extent heuristic, because it cannot be anything else. Moreover, mind that Python’s garbage mostly cares about the working memory consumed by Python, not the temporary directories, which are a special need by JiTC*DE for the just-in-time compilation. So, if your RAM is free, Python’s garbage collector may not see the need to act.

I'll try adding a del DDE to my code and report back with the results.

Yes, and with catching exceptions, you should be able to delete any temporary directory once the simulation is over, whether successful or not.

However, I just looked into better ways to enforce the deletion of temporary directories using atexit and implemented them. Mind that they will only help if the respective Python instance closes (which however seems to apply to you). I can only test them so far as they do not appear to interfere with regular usage, so it would be great if you can see whether they actually improve things. To do so, please install the latest version of jitcxde_common from GitHub using something like:

pip3 install git+git://github.com/neurophysik/jitcxde_common

mwappner commented 2 years ago

Well, I used the new jitcxde_common version and it had a very strange effect: it deleted no file in the temp dir (instead of just some of them as it used to happen) but it also... worked? As in the simulation didn't stall and finished successfully. But all few thousand temp dirs are still there.

To investigate this I read over the tempfile and atexit module documentations and was wondering if in this line you are using mkdtemp instead of using TemporaryDirectory for some specific reason. As far as I understand, the latter should handle the cleanup on its own once the jitcxde object is deleted, not sure if it would be an improvement (you can always use tempdir.cleanup() instead of calling shutil to handle it). If I implement this change I'll report how it works.

Regarding my specific problem, what I did is change the target of the temp files to a new tempfile in the current directory like so:

import tempfile
tempdir = tempfile.TemporaryDirectory(dir=pathlib.Path(__file__).parent, prefix='jitcdde_tmp_')
tempfile.tempdir = tempdir.name

and then launch the processes. In a small scale test, that seemed to work, but I'll try a bigger scale one overnight and report back.

On a separate note, temp files were sometimes left in the working directory instead of in /tmp which, as it turns out, it makes sense if the OS for some reason wasn't able to access /tmp (out of memory, out of storage o whatever). Here's a quote from the tempfile docu regarding that:

Python searches a standard list of directories to find one which the calling user can create files in. The list is:

The directory named by the TMPDIR environment variable.

The directory named by the TEMP environment variable.

The directory named by the TMP environment variable.

A platform-specific location: On Windows, the directories C:\TEMP, C:\TMP, \TEMP, and \TMP, in that order. On all other platforms, the directories /tmp, /var/tmp, and /usr/tmp, in that order.

As a last resort, the current working directory.

So as per point 5 that mistery might be solved.

Wrzlprmft commented 2 years ago

Well, I used the new jitcxde_common version and it had a very strange effect: it deleted no file in the temp dir (instead of just some of them as it used to happen) but it also... worked? As in the simulation didn't stall and finished successfully. But all few thousand temp dirs are still there.

I informed myself a bit more about Python’s garbage collector and the gist of it is this:

Python’s garbage collector works with reference counts of objects which basically count how often an object is still used within/by existing Python objects in the current Python instance. If a reference count reaches zero, the object is deleted upon the next garbage collection. (This I already knew.)
del only reduces the reference count, but doesn’t necessarily call the __del__ method of the jitc\*de instance to be deleted (which is where the temporary-directory deletion resides).
Registering __del__ (or any other class internal) with atexit would increase the reference count of the jitc\*de instance and therefore not cause it to be deleted after del. Therefore directories stick around until the program actually exits (in which case they should be deleted).
If you still need to manually delete the temporary directory of a jitc\*de instance, calling its __del__ method would be the way to go, since that instantly deletes the temporary directory (though not the jitc\*de instance itself) or throws a warning if that cannot be done.

As a result, temporary directories almost unavoidably stick around a bit longer than they are actually needed unless you call __del__.

I […] was wondering if in this line you are using mkdtemp instead of using TemporaryDirectory for some specific reason.

I honestly cannot remember. I implemented this at the very beginning of the project; it worked (until now); I never touched it again. Either it’s due to tutorials or SO answers that were outdated (even then) or maintaining Python 2 compatibility (which I already dropped long ago). Anyway, I now switched to TemporaryDirectory and it doesn’t cause any problems. All of this happened in jitcxde_common, so that’s what you would need to update.

However, the problems you encountered may not be related to this and the best solution for you may still be to call DDE.__del__ or similar. On the other hand, TemporaryDirectory does better clean up after itself, so that may already suffice for you.

Wrzlprmft commented 2 years ago

Also note that calling gc.collect() after del DDE may cause Python’s garbage collector to work faster.

mwappner commented 2 years ago

So you suggest that maybe

DDE.__delete__()
del DDE
gc.collect()

can solve my issue, right?

Also, since 3e50de0 we are back to most temp files get deleted but some don't. I solved that issue, as outlined before, by setting a custom temp directory target in a place of my choosing and then deleting that directory.

Wrzlprmft commented 2 years ago

So you suggest that maybe
DDE.__delete__()
del DDE
gc.collect()
can solve my issue, right?

Yes and no. More precisely:

DDE.__del__() (not __delete__) alone should solve it, but is not exactly best practice since it may lead to weird errors. But it should be safe if you are certain that you don’t need DDE anymore.
del DDE; gc.collect() should also work (without __del__) unless you have some other reference to DDE lying around, which should not happen in regular usage, but may happen if you have wrapped class around everything or similar.

Also, since 3e50de0 we are back to most temp files get deleted but some don't. I solved that issue, as outlined before, by setting a custom temp directory target in a place of my choosing and then deleting that directory.

Do all involved Python scripts exit regularly (without being killed by the system or you)? If yes, I would be interested in a script where no deletion happened, since I failed to produce this: The only way to have the temporary directory persist was to hard-kill Python – in which case, I see no way of handling this beyond placing it in a designated place for temporary directories.

mwappner commented 2 years ago

Alright, small update: This didn't solve the issue, and at this point I have no clue what causes it. It's really hard to reproduce and it only arises in really long runs (meaning multiple short integrations, each with a slightly different set of parameters), so it's really bothersome to debug.

It didn't solve the issue, but it did move it somewhere else along the chain: now I'm running out of memory because the processes don't finish. I don't understand where that memory is going, since jitcdde basically only uses whatever it needs to keep track of the current state and my program always saves 10k timesteps or something like that. I don't understand why the memory usage should be variable in a situation like this.

In any case, I ended up going for the worst solution ever, but that lets me at least move forwards with my work: I just terminate the process if it takes more than a set ammount of time and record what parameter set cause that so I can later handle it however I want.

I really appreciate your dedication to trying to solve my problem and I think the changes you implemented are welcome regardless of they solving my issue or not.

Wrzlprmft commented 2 years ago

It didn't solve the issue, but it did move it somewhere else along the chain: now I'm running out of memory because the processes don't finish. I don't understand where that memory is going, since jitcdde basically only uses whatever it needs to keep track of the current state and my program always saves 10k timesteps or something like that. I don't understand why the memory usage should be variable in a situation like this.

Could it be that those processes never finished before, which is why you were stuck with the temporary directories?

Anyway, the memory usage of JiTCDDE is indeed inevitably variable for the following reason: All the past of the system up to the largest delay needs to be stored to perform the integration. How much memory this requires depends on the integration step size, which is in turn adaptive, depending on what is needed to achieve a predefined accuracy (see set_integration_parameters). If your system becomes excessively difficult to integrate for some parameter, the integration step can become very small and thus the required memory rather ~low~ high. Usually, you quickly get an UnsuccessfulIntegration exception, but depending on why exactly this happens, this can take quite long or even not happen at all. On top, in this case, the integration takes a long time to finish – which would match your observation. Thus, my best guess is that this indeed what afflicts you.

Assuming that I am correct, I recommend to first start looking at one of the integrations that fail and find out why. The most simple case would be that the dynamics is not bounded and escalates, which can easily be caught during the integration. If you want to avoid such long integrations and memory overloads in general, you can increase the min_step argument of set_integration_parameters, which will cause integrations requiring too much memory to fail more quickly with an UnsuccessfulIntegration exception.

mwappner commented 2 years ago

Thanks for the help, I learnt a bunch about how this library works and, given my problem, I decided placing a hard timeout on the integration was the easiest solution.

Assuming that I am correct, I recommend to first start looking at one of the integrations that fail and find out why.

Of course this was also my aim, but I couldn't an integration that failed to fail reliably. Moreover, it was hardware dependent or something? I switched from one cluster to another and the thing stopped failing. I'm somewhat sure that the issue was not in the library anyway, but in my usage of it, so I just powered through.

To anyone interested, as I said, I ended up using a hard timeout following this answer to run processes (and not threads) for a given amount of time and abort them if they take too long.