Closed sanersbug closed 6 years ago
Hmm it seems that your model has trained already and you can eval/predict now. It can be done in chunks just go -c 500 or smth
@jakubczakon The train is just begin, not over, the iteration is not start.
@sanersbug If you say that you obtained labeler transforming
this means, that training of your network has finished and postprocessing has just began. Step labeler is memory consuming. Indeed, it is very heavy. To cope with this you can just finish your training pipeline after unet
step. You don't need to perform postprocessing during training, do you?
To do it, simply add in your pipelines.py after line 27 sth like:
if train_mode:
return unet
This way training your pipeline will finish on the unet Step. Then you can run evaluation and prediction in chunks, as @jakubczakon mentioned. Hope it helps!
thanks too much @apyskir ,I'll try it
Seems like we have covered this question. Closing this.
@jakubczakon @apyskir @kamil-kaczmarek
I have a similar problem right now, but I have not reached labeler transforming yet. I have just gotten past "training finished", and the memory usage steadily climbs until it maxes out and the process is killed. Can I skip this as stated in this thread, or do i need to reach labeler transforming first?
Specifically I have this error message: RuntimeError: DataLoader worker (pid 16829) is killed by signal: Killed. Details are lost due to multi│ processing. Rerunning with num_workers=0 may give better error trace.
Also, what effect do number of workers and batch size have on this? I have reduced number of workers to 0 but that had no effect, and I have a feeling that batch size has no effect either.
My computer's memory is 64G , my train data only 15000 images , but when i begin tarin , it 's always stop at tansfom. At the 'labeler transforming', the memory only left 427456 , it used 63952588, the total is 65888932, and it stoped . If there is any way to solve the problem?
if the only way is change a better computer? Thanks a lot!!