neptune-ai / open-solution-mapping-challenge

Open solution to the Mapping Challenge :earth_americas:
https://www.crowdai.org/challenges/mapping-challenge
MIT License
378 stars 96 forks source link

Enormous RAM usage after steps >>> training finished #191

Closed carbonox-infernox closed 5 years ago

carbonox-infernox commented 5 years ago

I'm using a machine with 122 GB of RAM with 1 GPU, a batch size of 20, and number of workers 4. After I've trained and reached:

steps >>> training finished

there is no output for a while and I keep an eye on system usage using the top command. What happens is that one of the python processes slowly increases its RAM usage (over the course of 60 minutes) Until it consumes nearly 100% of the ram and the process is killed.

I may be able to access a system with more RAM, but how do i know how much I need? 122 GB seems like it should have been enough.

Can i edit any parameters to make it get by with less RAM?

Edit: Went back through my logs to find the actual error:

RuntimeError: DataLoader worker (pid 16829) is killed by signal: Killed. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.

I have actually tried with num_workers=0 and i got the same error.

carbonox-infernox commented 5 years ago

Update: I obtained an instance with 244 GB of RAM and ran training again with 0 epochs to skip to this part:

steps >>> training finished Traceback (most recent call last):

File "main.py", line 93, in

main()

File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/click/core.py", line 722, in call

return self.main(*args, **kwargs)

File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/click/core.py", line 697, in main

rv = self.invoke(ctx)

File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/click/core.py", line 1066, in invoke

return _process_result(sub_ctx.command.invoke(sub_ctx))

File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/click/core.py", line 895, in invoke

return ctx.invoke(self.callback, **ctx.params)

File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/click/core.py", line 535, in invoke

return callback(*args, **kwargs)

File "main.py", line 31, in train

pipeline_manager.train(pipeline_name, dev_mode)

File "/ebs/osmc/src/pipeline_manager.py", line 32, in train

train(pipeline_name, dev_mode, self.logger, self.params, self.seed)

File "/ebs/osmc/src/pipeline_manager.py", line 116, in train

pipeline.fit_transform(data)

File "/ebs/osmc/src/steps/base.py", line 106, in fit_transform

step_inputs[input_step.name] = input_step.fit_transform(data)

File "/ebs/osmc/src/steps/base.py", line 106, in fit_transform

step_inputs[input_step.name] = input_step.fit_transform(data)

File "/ebs/osmc/src/steps/base.py", line 106, in fit_transform

step_inputs[input_step.name] = input_step.fit_transform(data)

[Previous line repeated 3 more times]

File "/ebs/osmc/src/steps/base.py", line 112, in fit_transform

return self._cached_fit_transform(step_inputs)

File "/ebs/osmc/src/steps/base.py", line 123, in _cached_fit_transform

step_output_data = self.transformer.fit_transform(**step_inputs)

File "/ebs/osmc/src/steps/base.py", line 263, in fit_transform

return self.transform(*args, **kwargs)

File "/ebs/osmc/src/models.py", line 95, in transform

outputs = self._transform(datagen, validation_datagen)

File "/ebs/osmc/src/steps/pytorch/models.py", line 145, in _transform

outputs = {'{}_prediction'.format(name): np.vstack(outputs_) for name, outputs_ in outputs.items()}

File "/ebs/osmc/src/steps/pytorch/models.py", line 145, in

outputs = {'{}_prediction'.format(name): np.vstack(outputs_) for name, outputs_ in outputs.items()}

File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/numpy/core/shape_base.py", line 234, in vstack

return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)

MemoryError

jakubczakon commented 5 years ago

Hi @carbonox-infernox. I think I know what happens. Basically after the training is finished the pipeline that we currently have is transforming on the entire train dataset. In our case we have never trained utill finish and simply used/copied checkpoints so we never got this problem.

Basically your model is trained and saved in the training_experiment/checkpoints/unet/best.torch and you can copy this to your evaluation_experiment/transformers/unet and run evaluation/prediction in chunks.

Notice that we have created evaluation/prediction in chunks to avoid the memory issue but during training we didn't cause usually you just keep training and using your best checkpoint.

carbonox-infernox commented 5 years ago

@jakubczakon Thanks! That makes a lot of sense now.

Is there something I can put in the code to just make it stop after it creates that last best.torch checkpoint? What is the reason for this post-training stuff: Was it just for scoring your model for the competition?

Also, I've been meaning to ask about this: I've seen in 2 or 3 places in the issues that you were going to train with a sample of only 50,000 training images per epoch. Is this what you ended up doing?

How many epochs would I need to train (before the accuracy plateaus) with 50,000 images, as opposed to the the number of epochs for the full 280,731 image dataset?