neptune-ai / open-solution-mapping-challenge

Open solution to the Mapping Challenge :earth_americas:
https://www.crowdai.org/challenges/mapping-challenge
MIT License
380 stars 96 forks source link

why the memory consumption is too big in the data prepare? #169

Closed sanersbug closed 6 years ago

sanersbug commented 6 years ago

My computer's memory is 64G , my train data only 15000 images , but when i begin tarin , it 's always stop at tansfom. At the 'labeler transforming', the memory only left 427456 , it used 63952588, the total is 65888932, and it stoped . If there is any way to solve the problem?
if the only way is change a better computer? Thanks a lot!!

jakubczakon commented 6 years ago

Hmm it seems that your model has trained already and you can eval/predict now. It can be done in chunks just go -c 500 or smth

sanersbug commented 6 years ago

@jakubczakon The train is just begin, not over, the iteration is not start.

apyskir commented 6 years ago

@sanersbug If you say that you obtained labeler transforming this means, that training of your network has finished and postprocessing has just began. Step labeler is memory consuming. Indeed, it is very heavy. To cope with this you can just finish your training pipeline after unet step. You don't need to perform postprocessing during training, do you? To do it, simply add in your pipelines.py after line 27 sth like:

if train_mode:
  return unet

This way training your pipeline will finish on the unet Step. Then you can run evaluation and prediction in chunks, as @jakubczakon mentioned. Hope it helps!

sanersbug commented 6 years ago

thanks too much @apyskir ,I'll try it

kamil-kaczmarek commented 6 years ago

Seems like we have covered this question. Closing this.

carbonox-infernox commented 6 years ago

@jakubczakon @apyskir @kamil-kaczmarek

I have a similar problem right now, but I have not reached labeler transforming yet. I have just gotten past "training finished", and the memory usage steadily climbs until it maxes out and the process is killed. Can I skip this as stated in this thread, or do i need to reach labeler transforming first?

Specifically I have this error message: RuntimeError: DataLoader worker (pid 16829) is killed by signal: Killed. Details are lost due to multi│ processing. Rerunning with num_workers=0 may give better error trace.

Also, what effect do number of workers and batch size have on this? I have reduced number of workers to 0 but that had no effect, and I have a feeling that batch size has no effect either.