Trained model does not work

urtepuod commented 5 months ago

Hello, I've tried to train a new model from scratch using these settings: omnipose --train --use_gpu --dir "/home/urte/3D modeller/3d_cell_detector/trainingdata/Omni_5" \ --img_filter '' --mask_filter _cp_masks \ --pretrained_model None \ --diameter 0 --nclasses 3 --nchan3 --tyx 512,512 \ --learning_rate 0.1 --RAdam --batch_size 5 --n_epochs 900 --save_every 300 --verbose The training is successful, however if I try to import the model into the GUI, I get this error:
2024-01-22 11:45:23,186 [INFO] TORCH GPU version installed and working. 2024-01-22 11:45:23,188 [INFO] >>>> using GPU ERROR: Error(s) in loading state_dict for CPnet: size mismatch for downsample.down.res_down_0.conv.conv_0.0.weight: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([1]). size mismatch for downsample.down.res_down_0.conv.conv_0.0.bias: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([1]). size mismatch for downsample.down.res_down_0.conv.conv_0.0.running_mean: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([1]). size mismatch for downsample.down.res_down_0.conv.conv_0.0.running_var: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([1]). size mismatch for downsample.down.res_down_0.conv.conv_0.2.weight: copying a param with shape torch.Size([32, 3, 3, 3]) from checkpoint, the shape in current model is torch.Size([32, 1, 3, 3]). size mismatch for downsample.down.res_down_0.proj.0.weight: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([1]). size mismatch for downsample.down.res_down_0.proj.0.bias: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([1]). size mismatch for downsample.down.res_down_0.proj.0.running_mean: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([1]). size mismatch for downsample.down.res_down_0.proj.0.running_var: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([1]). size mismatch for downsample.down.res_down_0.proj.1.weight: copying a param with shape torch.Size([32, 3, 1, 1]) from checkpoint, the shape in current model is torch.Size([32, 1, 1, 1]). size mismatch for output.2.weight: copying a param with shape torch.Size([4, 32, 1, 1]) from checkpoint, the shape in current model is torch.Size([3, 32, 1, 1]). size mismatch for output.2.bias: copying a param with shape torch.Size([4]) from checkpoint, the shape in current model is torch.Size([3]). I had tried training with nchan 2, nclases 3, but during training it will automatically reset to nchan 3. I have grayscale images with masks produced from cellpose in my training data. I apologise if this is trivial, however this is all very new to me.

kevinjohncutler commented 5 months ago

@urtepuod sorry for the delay! You may want to email me at kcutler@uw.edu to debug further. I'd like to get your model and an example image to debug. I also need your pip list. I usually see this issue when cellpose has not been fully uninstalled or if we are just working with an older version of cellpose_omni and omnipose. In the most recent version of the GUI, you can choose nchan and select "boundary field output" if you trained with nclasses 3. If you images are grayscale, however, I think the model should have been trained with no channels (but RGB grayscale could have messed that up).

marieanselmet commented 1 month ago

Hello, I have the same problem here. If I train an omnipose model with nclasses = 4, I need to specify nclasses = 4 for the inference when using this model, otherwise I obtain the same error as above since by default nclasses = 2 now. Why this choice for the default value of nclasses ? How much it would impair the performance when training an omnipose model to loose 2 output branches ? Thanks a lot !

kevinjohncutler / omnipose

Trained model does not work #73