minerllabs / minerl

MineRL Competition for Sample Efficient Reinforcement Learning - Python Package
http://minerl.io/docs/
Other
705 stars 155 forks source link

batch_iter memory leak #356

Closed JACKHAHA363 closed 4 years ago

JACKHAHA363 commented 4 years ago
import minerl

if __name__ == '__main__':
    data = minerl.data.make('MineRLTreechopVectorObf-v0')
    batch_iter = data.batch_iter(6, 50, num_epochs=-1)
    for idx, batch in enumerate(batch_iter):
        pass

This script produces memory leak. minerl==0.3.6, python3.7

JACKHAHA363 commented 4 years ago

Using python3.6 seems to fix it.

vincent-thevenin commented 4 years ago
import minerl
import gym

def test():
    data = minerl.data.make('MineRLTreechopVectorObf-v0')
    for i in range(10):
        for current_state, action, reward, next_state, doneq \
                    in data.batch_iter(batch_size=1, num_epochs=1, seq_len=2, preload_buffer_size=1):
            print(reward)
            break
        print(i)

test()

I have the same issue minerl==0.3.6. Going to python3.6 seems to revert it somewhat. However, the code above will still create memory leak issues with python3.6. There's some problem with closing the processes.

The outer loops slows down at every iteration until it crashes.

MichalOp commented 4 years ago

It seems that the problem lies in minerl.data.util.OrderedJobStreamer.

For some reason it does this:

results = queue.Queue()
# Enqueue jobs
for arg in tqdm.tqdm(self.job_args):
    results.put(ex.submit(self.job, arg))

# Dequeu jobs and push them to a queue.
while not results.empty() and not self._should_exit:
    future = results.get()
    if future.exception():
        raise future.exception()
    res = future.result()

    while not self._should_exit:
        try:
            self.output_queue.put(res, block=True, timeout=1)
            break
        except queue.Full:
            pass

That means it starts to execute loading jobs for all replays in the directory in the background before it even starts to use loaded data, and if the consumer fails to consume them fast enough, it will possibly load them all before they can be used.

In @vincent-thevenin case it starts 10 separate loading processes that will load all data into memory 10 times and it never uses data that it loaded, causing massive memory usage.

Side note: The whole loading system seems rather simplistic. If I do batch_iter with batch size 10, I would want it to iterate over 10 separate replays cut into sequences so that if I train RNN agent I can use network state from the previous iteration and I have possibly diverse samples in a batch, instead of getting a batch of 10 sequences cut from the same replay.

NikEyX commented 4 years ago

I have the same problem (mentioned it on the discourse a week ago: https://discourse.aicrowd.com/t/k-means-example-code-crashing/3307/2) - is there a workaround for that other than reverting to py3.6?

MadcowD commented 4 years ago

We should fix this. Thanks for noticing -_-

marthinwurer commented 4 years ago

I'm seeing similar stuff on python 3.8.5.

BATCH_SIZE = 16
data = minerl.data.make(MINERL_GYM_ENV, data_dir=MINERL_DATA_ROOT, num_workers=1, worker_batch_size=BATCH_SIZE)
for batch_data in data.batch_iter(batch_size=BATCH_SIZE, num_epochs=1, seq_len=1, preload_buffer_size=1):
    break

Takes at least 15 seconds to run, then throws an OSError:

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/user/.pyenv/versions/3.8.5/lib/python3.8/concurrent/futures/process.py", line 102, in _python_exit
    thread_wakeup.wakeup()
  File "/home/user/.pyenv/versions/3.8.5/lib/python3.8/concurrent/futures/process.py", line 90, in wakeup
    self._writer.send_bytes(b"")
  File "/home/user/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/connection.py", line 183, in send_bytes
    self._check_closed()
  File "/home/user/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/connection.py", line 136, in _check_closed
    raise OSError("handle is closed")
OSError: handle is closed

There's definitely a memory leak in the background; If I manually make the iter and grab next it fills my memory up in the background and crashes everything. It's hard to do data exploration like this.

MichalOp commented 4 years ago

There seem to be two separate issues:

The first is caused by workers that try to load all data in the background. I added a fix for it that prevents worker processes from loading more replays than preload_buffer_size, so with this fix and using python 3.6 it should be possible to use batch_iter as intended. It should be also possible to manually make the iter like @marthinwurer wanted, although at the end of the epoch there still will be problems related to the second issue if using python >=3.7. (You also shouldn't break the loop as the workers won't be cleaned up if the loop doesn't finish normally, I think.)

The second issue seems to be caused by the fact that in python >=3.7 workers aren't stopped correctly after the epoch finishes. It is better described in #349, so I think maybe it should be investigated there. Unlike the first issue this causes a true memory leak, but it is definitely less apparent, as the workers don't use that much memory and if your epochs take relatively long you can probably ignore it for a while.

MichalOp commented 4 years ago

I changed my fix a bit to use basic multiprocessing.Process instead of concurrent.futures.ProcessPoolExecutor. Should work now for python >= 3.7 too, also has a side effect of actually using the number of workers provided as an argument to DataPipeline.

decodyng commented 4 years ago

(You also shouldn't break the loop as the workers won't be cleaned up if the loop doesn't finish normally, I think.)

I've been reading through this issue and trying to get a handle on what the problems here imply. Given this quoted section, is it a correct interpretation that you can't currently cleanly only pull out data from a subset of an epoch? (Since doing so would either require modifying _get_all_valid_recordings, which would require modification to MineRL itself, or breaking in the middle of a full-epoch batch_iter, since, as mentioned here, the workers might not be cleaned up properly in that situation.

If that is a correct interpretation, I might create a separate issue to discuss this as a valuable feature for the code to implement (particularly for testing situations where you may not need or want to load a full epoch), but I wanted to first check the intuitions of the folks on this thread who have done more digging into this part of the code than I have.

MichalOp commented 4 years ago

Given this quoted section, is it a correct interpretation that you can't currently cleanly only pull out data from a subset of an epoch?

That seems to be correct.

Since doing so would either require modifying _get_all_valid_recordings, which would require modification to MineRL itself, or breaking in the middle of a full-epoch batch_iter, since, as mentioned here, the workers might not be cleaned up properly in that situation.

I think actually both of those things should be made easier, for example batch_iter could accept a list of replays or folders with replays as input, so that you can simply decide what specific replays you want to load, and it also could launch in single-threaded mode without multiprocessing when you specify workers=0 in minerl.data.make - that should make it safe to break out of loops.