Closed carbonox-infernox closed 5 years ago
Update: I obtained an instance with 244 GB of RAM and ran training again with 0 epochs to skip to this part:
steps >>> training finished Traceback (most recent call last):
File "main.py", line 93, in
main()
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "main.py", line 31, in train
pipeline_manager.train(pipeline_name, dev_mode)
File "/ebs/osmc/src/pipeline_manager.py", line 32, in train
train(pipeline_name, dev_mode, self.logger, self.params, self.seed)
File "/ebs/osmc/src/pipeline_manager.py", line 116, in train
pipeline.fit_transform(data)
File "/ebs/osmc/src/steps/base.py", line 106, in fit_transform
step_inputs[input_step.name] = input_step.fit_transform(data)
File "/ebs/osmc/src/steps/base.py", line 106, in fit_transform
step_inputs[input_step.name] = input_step.fit_transform(data)
File "/ebs/osmc/src/steps/base.py", line 106, in fit_transform
step_inputs[input_step.name] = input_step.fit_transform(data)
[Previous line repeated 3 more times]
File "/ebs/osmc/src/steps/base.py", line 112, in fit_transform
return self._cached_fit_transform(step_inputs)
File "/ebs/osmc/src/steps/base.py", line 123, in _cached_fit_transform
step_output_data = self.transformer.fit_transform(**step_inputs)
File "/ebs/osmc/src/steps/base.py", line 263, in fit_transform
return self.transform(*args, **kwargs)
File "/ebs/osmc/src/models.py", line 95, in transform
outputs = self._transform(datagen, validation_datagen)
File "/ebs/osmc/src/steps/pytorch/models.py", line 145, in _transform
outputs = {'{}_prediction'.format(name): np.vstack(outputs_) for name, outputs_ in outputs.items()}
File "/ebs/osmc/src/steps/pytorch/models.py", line 145, in
outputs = {'{}_prediction'.format(name): np.vstack(outputs_) for name, outputs_ in outputs.items()}
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/numpy/core/shape_base.py", line 234, in vstack
return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
MemoryError
Hi @carbonox-infernox. I think I know what happens. Basically after the training is finished the pipeline that we currently have is transforming on the entire train dataset. In our case we have never trained utill finish and simply used/copied checkpoints so we never got this problem.
Basically your model is trained and saved in the training_experiment/checkpoints/unet/best.torch
and you can copy this to your evaluation_experiment/transformers/unet
and run evaluation/prediction in chunks.
Notice that we have created evaluation/prediction in chunks to avoid the memory issue but during training we didn't cause usually you just keep training and using your best checkpoint.
@jakubczakon Thanks! That makes a lot of sense now.
Is there something I can put in the code to just make it stop after it creates that last best.torch checkpoint? What is the reason for this post-training stuff: Was it just for scoring your model for the competition?
Also, I've been meaning to ask about this: I've seen in 2 or 3 places in the issues that you were going to train with a sample of only 50,000 training images per epoch. Is this what you ended up doing?
How many epochs would I need to train (before the accuracy plateaus) with 50,000 images, as opposed to the the number of epochs for the full 280,731 image dataset?
I'm using a machine with 122 GB of RAM with 1 GPU, a batch size of 20, and number of workers 4. After I've trained and reached:
steps >>> training finished
there is no output for a while and I keep an eye on system usage using the top command. What happens is that one of the python processes slowly increases its RAM usage (over the course of 60 minutes) Until it consumes nearly 100% of the ram and the process is killed.
I may be able to access a system with more RAM, but how do i know how much I need? 122 GB seems like it should have been enough.
Can i edit any parameters to make it get by with less RAM?
Edit: Went back through my logs to find the actual error:
RuntimeError: DataLoader worker (pid 16829) is killed by signal: Killed. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.
I have actually tried with num_workers=0 and i got the same error.