Experiments work on machine with older env, but not another with newer env

steipatr commented 1 year ago

I have some EMA+pynetlogo code that runs on my desktop in a somewhat older environment, but not on my laptop in a newer environment. Testing various example models, the pynetlogo SALib demo (https://github.com/quaquel/pyNetLogo/blob/master/examples/predprey_salib_multiprocessing.py) works only on the desktop, but https://github.com/quaquel/EMAworkbench/blob/master/ema_workbench/examples/example_netlogo.py works on both.

Exemplary traceback from the laptop for the pynetlogo SALib demo:

(robustresilience39) C:\Local\Testing>python predprey_salib_multiprocessing.py
C:\Local\Testing\predprey_salib_multiprocessing.py:81: DeprecationWarning: `salib.sample.saltelli` will be removed in SALib 1.5. Please use `salib.sample.sobol`
  param_values = saltelli.sample(problem, n, calc_second_order=True)
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "C:\Users\stein042\.conda\envs\robustresilience39\lib\multiprocessing\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\stein042\.conda\envs\robustresilience39\lib\multiprocessing\pool.py", line 48, in mapstar
    return list(map(*args))
  File "C:\Local\Testing\predprey_salib_multiprocessing.py", line 48, in run_simulation
    counts = netlogo.repeat_report(["count sheep", "count wolves"], 100)
  File "C:\Users\stein042\.conda\envs\robustresilience39\lib\site-packages\pynetlogo\core.py", line 557, in repeat_report
    os.remove(value)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Local\\Testing\\is21qc9b\\tmp273ew_1s.txt'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Local\Testing\predprey_salib_multiprocessing.py", line 89, in <module>
    for entry in executor.map(run_simulation, experiments.to_dict("records")):
  File "C:\Users\stein042\.conda\envs\robustresilience39\lib\multiprocessing\pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "C:\Users\stein042\.conda\envs\robustresilience39\lib\multiprocessing\pool.py", line 771, in get
    raise self._value
PermissionError: [Errno 13] The process cannot access the file because it is being used by another process: 'C:\\Local\\Testing\\is21qc9b\\tmp273ew_1s.txt'

My suspicion is that something related to working directories has changed between package versions, but I might be completely off. I looked at the changelogs for pynetlogo and EMA but didn't spot anything that seemed relevant. Does this problem ring a bell?

Desktop, Win10 Enterprise ema-workbench 2.2.0 pypi_0 pypi multiprocess 0.70.13 pypi_0 pypi pynetlogo 0.4.2 pypi_0 pypi python 3.10.6 h9a09f29_0_cpython conda-forge

Laptop, Win10 Enterprise ema-workbench 2.3.0 pypi_0 pypi multiprocess 0.70.14 pypi_0 pypi pynetlogo 0.5.1 pypi_0 pypi python 3.9.16 h6244533_2

quaquel commented 1 year ago

Something did change in repeat_report in pynetlogo. I moved to using tempfiles from files whose filename was based on the name of the outcome of interest. What I find strange is that different, separately spawned processes end up using the exact same tempdir and tempfile. If this is a problem in pynetlogo, then it should be fixable using either a prefix or suffix based on e.g. the process id.

In contrast, the ema workbench example relies on a custom multiprocessing evaluator, which sets up explicitly separate working directories for each process. Moreover, how the command string is built is a bit more involved.

steipatr commented 1 year ago

First of all, I appreciate your support on all things EMA and pynetlogo. Thank you for your quick replies and insightful suggestions.

I tried adding the process ID (with os.getpid()) as a suffix to both mkdtemp and mkstemp in repeat_report to no avail:

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "C:\Users\stein042\.conda\envs\robustresilience39\lib\multiprocessing\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\stein042\.conda\envs\robustresilience39\lib\multiprocessing\pool.py", line 48, in mapstar
    return list(map(*args))
  File "C:\Local\Testing\predprey_salib_multiprocessing.py", line 47, in run_simulation
    counts = netlogo.repeat_report(["count sheep", "count wolves"], 100)
  File "C:\Users\stein042\.conda\envs\robustresilience39\lib\site-packages\pynetlogo\core.py", line 553, in repeat_report
    os.remove(value)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Local\\Testing\\i8f_xyod-10180\\tmpme8gp6ok-10180.txt'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Local\Testing\predprey_salib_multiprocessing.py", line 94, in <module>
    for entry in executor.map(run_simulation, experiments.to_dict("records")):
  File "C:\Users\stein042\.conda\envs\robustresilience39\lib\multiprocessing\pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "C:\Users\stein042\.conda\envs\robustresilience39\lib\multiprocessing\pool.py", line 771, in get
    raise self._value
PermissionError: [Errno 13] The process cannot access the file because it is being used by another process: 'C:\\Local\\Testing\\i8f_xyod-10180\\tmpme8gp6ok-10180.txt'

I also tried commenting out the temporary folder and file cleanup stuff in repeat_report. After which the code ran fine, although I also had to change around some output processing in https://github.com/quaquel/pyNetLogo/blob/master/examples/predprey_salib_multiprocessing.py#L50, from:

    results = pd.Series(
        [counts["count sheep"].values.mean(), counts["count wolves"].values.mean()],
        index=["Avg. sheep", "Avg. wolves"],
    )
    return results

to:

    results = pd.Series(
        [counts["count sheep"].mean(), counts["count wolves"].mean()],
        index=["Avg. sheep", "Avg. wolves"],
    )

due to error "numpy.ndarray object has no attribute values".

So perhaps the issue is somewhere in the temporary folder/file cleanup? Might be as simple as an identation error. Will dig around some more.

quaquel commented 1 year ago

ok, so this is clearly a pynetlogo rather than an EMA bug, and from the looks of it Windows specific.

steipatr commented 1 year ago

Yes I agree. I think that at owner level, it's possible to move issues between repos, perhaps this should be continued in the pynetlogo repo.

quaquel commented 1 year ago

I learn something new everyday.

steipatr commented 1 year ago

The plot thickens. File handles/descriptors are somehow being shared across processes. This might cause a mixup when os.close is called.

EDIT: Nevermind, the only thick thing here is me.

File handles (fhs) for Process 14868: [4, 5]
File names (fns) for Process 14868: {'count sheep': 'C:\\Local\\Testing\\aucn0836-14868\\tmp80m5l9mp-14868.txt', 'count wolves': 'C:\\Local\\Testing\\aucn0836-14868\\tmpy7p0_bjd-14868.txt'}
Process 14868 should now os.remove the filename C:\Local\Testing\aucn0836-14868\tmp80m5l9mp-14868.txt
Process 14868 should now os.remove the filename C:\Local\Testing\aucn0836-14868\tmpy7p0_bjd-14868.txt
Process 14868 should now os.close the file handle 4
Process 14868 should now os.close the file handle 5

File handles (fhs) for Process 9824: [4, 5]
File names (fns) for Process 9824: {'count sheep': 'C:\\Local\\Testing\\bbjm843t-9824\\tmpt0wohpp1-9824.txt', 'count wolves': 'C:\\Local\\Testing\\bbjm843t-9824\\tmpu4dsoax0-9824.txt'}
Process 9824 should now os.remove the filename C:\Local\Testing\bbjm843t-9824\tmpt0wohpp1-9824.txt
Process 9824 should now os.remove the filename C:\Local\Testing\bbjm843t-9824\tmpu4dsoax0-9824.txt
Process 9824 should now os.close the file handle 4
Process 9824 should now os.close the file handle 5

Side note, it's apparently normal for the file handles to start at 4, 1-3 are reserved for system input, output, and error stuff.

steipatr commented 1 year ago

Re-ordering the cleanup code in repeat_report seems to have fixed the problem. I still need to verify that this doesn't screw with the outputs somehow though, I'm somewhat concerned about that based on the original error messages.

        for key, value in fns.items():
            [...]
            os.remove(value)

        # cleanup temp files and folders
        for fh in fhs:
            os.close(fh)
        os.rmdir(tempfolder)

becomes:

        # cleanup temp files and folders
        for fh in fhs:
            os.close(fh) #free up file handle for re-use

        for key, value in fns.items():
             os.remove(value) #delete file by name

        os.rmdir(tempfolder)

quaquel commented 1 year ago

Strangely, you need to remove the files before removing the entire directory.

I am wondering whether the shift to tempfiles was such a good idea given that we already have a temfolder.

steipatr commented 1 year ago

Yes I noticed that was a stated limitation of rmdir. There is apparently https://docs.python.org/3/library/shutil.html#shutil.rmtree which does remove directories with files.

steipatr commented 1 year ago

Side note, the results look untouched as far as I can tell, so that's good. I can submit a pull request, do you prefer the reordering solution, or switching to shutil.rmtree? Either way, since this is OS-level file interaction, it should probably all be tested on Unix as well, which I can't do.

quaquel commented 1 year ago

I am OK with either solution, although I slightly prefer shutil.rmtree.

I can test under MacOS, and based on experience so far, if that works, Linux works as well.

quaquel / pyNetLogo

Experiments work on machine with older env, but not another with newer env #72