Closed JACKHAHA363 closed 4 years ago
Using python3.6
seems to fix it.
import minerl
import gym
def test():
data = minerl.data.make('MineRLTreechopVectorObf-v0')
for i in range(10):
for current_state, action, reward, next_state, doneq \
in data.batch_iter(batch_size=1, num_epochs=1, seq_len=2, preload_buffer_size=1):
print(reward)
break
print(i)
test()
I have the same issue minerl==0.3.6. Going to python3.6 seems to revert it somewhat. However, the code above will still create memory leak issues with python3.6. There's some problem with closing the processes.
The outer loops slows down at every iteration until it crashes.
It seems that the problem lies in minerl.data.util.OrderedJobStreamer
.
For some reason it does this:
results = queue.Queue()
# Enqueue jobs
for arg in tqdm.tqdm(self.job_args):
results.put(ex.submit(self.job, arg))
# Dequeu jobs and push them to a queue.
while not results.empty() and not self._should_exit:
future = results.get()
if future.exception():
raise future.exception()
res = future.result()
while not self._should_exit:
try:
self.output_queue.put(res, block=True, timeout=1)
break
except queue.Full:
pass
That means it starts to execute loading jobs for all replays in the directory in the background before it even starts to use loaded data, and if the consumer fails to consume them fast enough, it will possibly load them all before they can be used.
In @vincent-thevenin case it starts 10 separate loading processes that will load all data into memory 10 times and it never uses data that it loaded, causing massive memory usage.
Side note: The whole loading system seems rather simplistic. If I do batch_iter
with batch size 10, I would want it to iterate over 10 separate replays cut into sequences so that if I train RNN agent I can use network state from the previous iteration and I have possibly diverse samples in a batch, instead of getting a batch of 10 sequences cut from the same replay.
I have the same problem (mentioned it on the discourse a week ago: https://discourse.aicrowd.com/t/k-means-example-code-crashing/3307/2) - is there a workaround for that other than reverting to py3.6?
We should fix this. Thanks for noticing -_-
I'm seeing similar stuff on python 3.8.5.
BATCH_SIZE = 16
data = minerl.data.make(MINERL_GYM_ENV, data_dir=MINERL_DATA_ROOT, num_workers=1, worker_batch_size=BATCH_SIZE)
for batch_data in data.batch_iter(batch_size=BATCH_SIZE, num_epochs=1, seq_len=1, preload_buffer_size=1):
break
Takes at least 15 seconds to run, then throws an OSError:
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/home/user/.pyenv/versions/3.8.5/lib/python3.8/concurrent/futures/process.py", line 102, in _python_exit
thread_wakeup.wakeup()
File "/home/user/.pyenv/versions/3.8.5/lib/python3.8/concurrent/futures/process.py", line 90, in wakeup
self._writer.send_bytes(b"")
File "/home/user/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/connection.py", line 183, in send_bytes
self._check_closed()
File "/home/user/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/connection.py", line 136, in _check_closed
raise OSError("handle is closed")
OSError: handle is closed
There's definitely a memory leak in the background; If I manually make the iter and grab next it fills my memory up in the background and crashes everything. It's hard to do data exploration like this.
There seem to be two separate issues:
The first is caused by workers that try to load all data in the background. I added a fix for it that prevents worker processes from loading more replays than preload_buffer_size
, so with this fix and using python 3.6 it should be possible to use batch_iter
as intended. It should be also possible to manually make the iter like @marthinwurer wanted, although at the end of the epoch there still will be problems related to the second issue if using python >=3.7.
(You also shouldn't break the loop as the workers won't be cleaned up if the loop doesn't finish normally, I think.)
The second issue seems to be caused by the fact that in python >=3.7 workers aren't stopped correctly after the epoch finishes. It is better described in #349, so I think maybe it should be investigated there. Unlike the first issue this causes a true memory leak, but it is definitely less apparent, as the workers don't use that much memory and if your epochs take relatively long you can probably ignore it for a while.
I changed my fix a bit to use basic multiprocessing.Process
instead of concurrent.futures.ProcessPoolExecutor
. Should work now for python >= 3.7 too, also has a side effect of actually using the number of workers provided as an argument to DataPipeline
.
(You also shouldn't break the loop as the workers won't be cleaned up if the loop doesn't finish normally, I think.)
I've been reading through this issue and trying to get a handle on what the problems here imply. Given this quoted section, is it a correct interpretation that you can't currently cleanly only pull out data from a subset of an epoch? (Since doing so would either require modifying _get_all_valid_recordings
, which would require modification to MineRL itself, or breaking in the middle of a full-epoch batch_iter
, since, as mentioned here, the workers might not be cleaned up properly in that situation.
If that is a correct interpretation, I might create a separate issue to discuss this as a valuable feature for the code to implement (particularly for testing situations where you may not need or want to load a full epoch), but I wanted to first check the intuitions of the folks on this thread who have done more digging into this part of the code than I have.
Given this quoted section, is it a correct interpretation that you can't currently cleanly only pull out data from a subset of an epoch?
That seems to be correct.
Since doing so would either require modifying _get_all_valid_recordings, which would require modification to MineRL itself, or breaking in the middle of a full-epoch batch_iter, since, as mentioned here, the workers might not be cleaned up properly in that situation.
I think actually both of those things should be made easier, for example batch_iter
could accept a list of replays or folders with replays as input, so that you can simply decide what specific replays you want to load, and it also could launch in single-threaded mode without multiprocessing when you specify workers=0 in minerl.data.make
- that should make it safe to break out of loops.
This script produces memory leak.
minerl==0.3.6
,python3.7