Open Nozoomhs opened 3 years ago
same problem, not able to run train and eval in different terminal.. any way model_main_tf2 can train and eval? in same process?
Hi, anyone struggling with this, you have to limit evaluation to CPU only by setting CUDA_VISIBLE_DEVICES=-1 at the start. For example:
UDA_VISIBLE_DEVICES=-1 python .../model_main_tf2.py
--model_dir=...
--pipeline_config_path=.../pipeline.config
--checkpoint_dir=...
--sample_1_of_n_eval_examples=1
I am having trouble evalutaing my training process during training a Tensorflow2 Custom Object Detector. After reading several issues related to this problem I found that evaluation and training should be treated as two seperate proccesses therefore I should use a new anaconda prompt for starting the evaluation job. I am training on the ssd_mobilenetv2 640x640 version. I would like to montior evalutaion on tensorboard to see whether my modell is overfitting or not. My pipeline configuration:
I have started the training with the command :
python model_main_tf2.py --model_dir=models/my_ssd2_3/ --pipeline_config_path=models/my_ssd2_3/pipeline.config --sample_1_of_n_eval_examples 1 --logtostderr
I was hoping that setting the number of evaluation examples will have an effect of starting the evaluation job. In any case I tried running the evaluation in a different terminal window with :python model_main_tf2.py --model_dir=models/my_ssd2_3 --pipeline_config_path=models/my_ssd2_3/pipeline.config --checkpoint_dir=models/my_ssd2_3/ --alsologtostderr
As soon as starting the evaluation the training job crashes with this error:The problem I think that I do not have the sufficient hardware:
8GB RAM NVIDIDA GTX960M (2GB RAM) Could it be a problem that all the input images that I use are 3000x3000, therefore the preprocesser has to load too many information? If so, is there any way to work around it? I would not want to resize all the images before generating TF record file, because I would have to re-label all the images. I clearly lack the insight of how the memory is being allocated during the start of the training process so some details would be much appreciated.
A second question is that during monitoring the training on tensorboard the images are displayed with various brightness I tried changing in the model_lib_v2.py file the 627 line to:
data= (features[fields.InputDataFields.image]-np.min(features[fields.InputDataFields.image]))/(np.max(features[fields.InputDataFields.image])-np.min(features[fields.InputDataFields.image]))
According to this issue Without any luck. Is there a solution to this problem? Also it would be nice if I could monitor there the bounding boxes the modell proposes. Thank you.