openai / gym

A toolkit for developing and comparing reinforcement learning algorithms.
https://www.gymlibrary.dev
Other
34.5k stars 8.59k forks source link

multiprocessing and monitor interaction #636

Closed alexander-turner closed 7 years ago

alexander-turner commented 7 years ago

I recently implemented multiprocessing for running episodes (since I'm testing with non-learning bandits). It works fine when no monitors are initialized, but I get an error when they are. The episode video files are written fine, but the manifest isn't:

Exception ignored in: <bound method Monitor.__del__ of <Monitor<TimeLimit<FrozenLakeEnv<FrozenLake-v0>>>>>
Traceback (most recent call last):
  File "C:\Users\Alex\OneDrive\Documents\Classes\OSU\Research\PyPlan\WinPython\python-3.6.1.amd64\lib\site-packages\gym\wrappers\monitoring.py", line 244, in __del__
    self.close()
  File "C:\Users\Alex\OneDrive\Documents\Classes\OSU\Research\PyPlan\WinPython\python-3.6.1.amd64\lib\site-packages\gym\wrappers\monitoring.py", line 153, in close
    self._flush(force=True)
  File "C:\Users\Alex\OneDrive\Documents\Classes\OSU\Research\PyPlan\WinPython\python-3.6.1.amd64\lib\site-packages\gym\wrappers\monitoring.py", line 144, in _flush
    }, f, default=json_encode_np)
  File "C:\Users\Alex\OneDrive\Documents\Classes\OSU\Research\PyPlan\WinPython\python-3.6.1.amd64\lib\contextlib.py", line 89, in __exit__
    next(self.gen)
  File "C:\Users\Alex\OneDrive\Documents\Classes\OSU\Research\PyPlan\WinPython\python-3.6.1.amd64\lib\site-packages\gym\utils\atomic_write.py", line 50, in atomic_write
    replace(tmppath, filepath)
PermissionError: [WinError 5] Access denied: (writing temporary manifest file to actual manifest file)

The code is here.

olegklimov commented 7 years ago

WinError means you are on Windows? Windows locks files when they are written to, and in some other cases.

With any kind of multiprocessing, a good idea is to write to different files. It's second parameter in Monitor constructor for this purpose.

alexander-turner commented 7 years ago

Which parameter would that be - uid? These are the ones I see. I currently have force disabled and resume enabled.

directory (str): A per-training run directory where to record stats.
video_callable (Optional[function, False]): function that takes in the index of the episode and
    outputs a boolean, indicating whether we should record a video on this episode.
    The default (for video_callable is None) is to take perfect cubes, capped at 1000.
    False disables video recording.
force (bool): Clear out existing training data from this directory
     (by deleting every file prefixed with "openaigym.").
resume (bool): Retain the training data already in this directory, which will be merged with our new data
write_upon_reset (bool): Write the manifest file on each reset. (This is currently a JSON file,
     so writing it is somewhat expensive.)
uid (Optional[str]): A unique id used as part of the suffix for the file. By default, uses os.getpid().
     mode (['evaluation', 'training']): Whether this is an evaluation or training episode.
olegklimov commented 7 years ago

Oops I was looking on different Monitor, sorry. So as a random advice, try to disable writing manifest for every process except one. If there's a problem others are likely to hit, let's fix it for everyone.

alexander-turner commented 7 years ago

I'm not quite sure how to disable writing for all but one, but I did enable write_upon_reset and that fixed the access error. However, the episode_batch file still leaves much to be desired:

{"initial_reset_timestamp": 1498670304.8532367, "timestamps": [], "episode_lengths": [], "episode_rewards": [], "episode_types": ["t"]}

Accordingly, this error appears:

gym.error.InvalidRequestError: Request req_JN9PvTSQ3GqpxZ41PHm5A: Must provide a training episode batch.

Edit: For unrelated reasons, I updated the interface so it uses multiprocessing.Pool; the issue remains, however.

alexander-turner commented 7 years ago

Fixed issue - problem was with how I was closing the environments and keeping track of the monitors.

alexander-turner commented 7 years ago

I solved another issue, but this underlying issue is still there, unfortunately. It will now run and save the data correctly, but it still won't write the correct information to the episode_batch file, so I can't upload anything (even though all the videos and other metadata are present).

alexander-turner commented 7 years ago

Figured this out more quickly than I thought I would. Turns out that using multiprocessing like this causes stats_recorder to flush its output way too early (when it doesn't have anything, generally). I fixed this by having each process in the pool pass back its local episode information, manually creating the requisite lists for the batch file in the env.stats_recorder object, flushing, and then uploading.