Trying to fine tune a model but it fails. I need help and guidance.

fsalfonzo commented 11 months ago

Hi I need some advice in fine tuning a model. For some reason I can train a model from scratch by using the following CLI:

If I try to explicitly write the --nclass 2, it crashes.

If I try the following command, it also crashes.

could you guide me in addressing this issue?
Could you also show me how to load the custom model ?

I don't have a problem using the model already provided. So It must be something I am missing.

Error description shown below: (omnipose) C:\Users\fsa>python -m omnipose --train --dir C:\Users\fsa\Desktop\bact_phase\train_sorted\5I_crop --mask_filter _masks --n_epochs 10 --pretrained_model bact_phase_omni --learning_rate 0.05 --diameter 0 --batch_size 16 --save_every 50 --RAdam !NEW LOGGING SETUP! To see cellpose progress, set --verbose No --verbose => no progress or info printed 2023-11-02 23:41:24,034 [INFO] >>>> using CPU 2023-11-02 23:41:24,034 [INFO] This model uses boundary field, setting nclasses=3. 2023-11-02 23:41:24,034 [INFO] Training omni model. Setting nclasses=3, RAdam=True 2023-11-02 23:41:24,038 [INFO] not all flows are present, will run flow generation for all images 2023-11-02 23:41:24,042 [INFO] pretrained model C:\Users\fsa.cellpose\models\bact_phase_omnitorch_0 is being used 2023-11-02 23:41:24,042 [INFO] median diameter set to 0 => no rescaling during training 2023-11-02 23:41:24,186 [INFO] No precomuting flows with Omnipose. Computed during training. 2023-11-02 23:41:24,205 [INFO] >>> Using RAdam optimizer 2023-11-02 23:41:24,206 [INFO] >>>> training network with 2 channel input <<<< 2023-11-02 23:41:24,206 [INFO] >>>> LR: 0.05000, batch_size: 16, weight_decay: 0.00001 2023-11-02 23:41:24,206 [INFO] >>>> ntrain = 5 2023-11-02 23:41:24,206 [INFO] >>>> nimg_per_epoch = 5 2023-11-02 23:41:24,206 [INFO] >>>> Start time: 23:41:24 C:\Users\fsa\anaconda3\envs\omnipose\lib\site-packages\omnipose\utils.py:220: RuntimeWarning: invalid value encountered in divide return module.clip((Y-lower_val)/(upper_val-lower_val),0,1) C:\Users\fsa\anaconda3\envs\omnipose\lib\site-packages\omnipose\utils.py:53: RuntimeWarning: invalid value encountered in cast return np.uint16(rescale(im)*(2*16-1)) 2023-11-02 23:41:27,116 [INFO] Train epoch: 0 | Time: 0.05min | last epoch: 0.00s | <sec/epoch>: 0.00s | <sec/batch>: 0.84s | : 1.140086 | : 1.140086 2023-11-02 23:41:27,117 [INFO] saving network parameters to C:\Users\fsa\Desktop\bact_phase\train_sorted\5I_crop\models/cellpose_residual_on_style_on_concatenation_off_omni_abstract_nclasses_3_nchan_2_dim_2_5I_crop_2023_11_02_23_41_24.194542 Traceback (most recent call last): File "C:\Users\fsa\anaconda3\envs\omnipose\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\fsa\anaconda3\envs\omnipose\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\fsa\anaconda3\envs\omnipose\lib\site-packages\omnipose__main.py", line 12, in main() File "C:\Users\fsa\anaconda3\envs\omnipose\lib\site-packages\omnipose__main__.py", line 9, in main cellpose_omni_main(args) File "C:\Users\fsa\anaconda3\envs\omnipose\lib\site-packages\cellpose_omni\main__.py", line 439, in main cpmodel_path = model.train(images, labels, links, train_files=image_names, File "C:\Users\fsa\anaconda3\envs\omnipose\lib\site-packages\cellpose_omni\models.py", line 1572, in train model_path = self._train_net(train_data, train_labels, train_links, File "C:\Users\fsa\anaconda3\envs\omnipose\lib\site-packages\cellpose_omni\core.py", line 1187, in _train_net train_loss = self._train_step(self._to_device(np.stack(imgi)),lbl) File "C:\Users\fsa\anaconda3\envs\omnipose\lib\site-packages\cellpose_omni\core.py", line 834, in _train_step loss = self.loss_fn(lbl,y) File "C:\Users\fsa\anaconda3\envs\omnipose\lib\site-packages\cellpose_omni\models.py", line 1396, in loss_fn loss = omnipose.core.loss(self, lbl, y) File "C:\Users\fsa\anaconda3\envs\omnipose\lib\site-packages\omnipose\core.py", line 2672, in loss return 2(5loss1+loss2+loss4+loss5+loss6)+self.criterion0(flow,veci) # golden? File "C:\Users\fsa\anaconda3\envs\omnipose\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(args, **kwargs) File "C:\Users\fsa\anaconda3\envs\omnipose\lib\site-packages\torchvf\losses\ivp_loss.py", line 105, in forward pred_trajectories = self._compute_batched_trajectories(vf_pred) File "C:\Users\fsa\anaconda3\envs\omnipose\lib\site-packages\torchvf\losses\ivp_loss.py", line 84, in _compute_batched_trajectories trajectories = ivp_solver( File "C:\Users\fsa\anaconda3\envs\omnipose\lib\site-packages\torchvf\numerics\integration\ivp_int.py", line 61, in ivpsolver points, = f_solver.step(points, dx) File "C:\Users\fsa\anaconda3\envs\omnipose\lib\site-packages\torchvf\numerics\integration\solvers.py", line 32, in step k1 = self.f(points) File "C:\Users\fsa\anaconda3\envs\omnipose\lib\site-packages\torchvf\numerics\interpolation\interp_vf.py", line 32, in _vf out = nearest_interpolation_batched(vector_field, p) File "C:\Users\fsa\anaconda3\envs\omnipose\lib\site-packages\torchvf\numerics\interpolation\functional.py", line 229, in nearest_interpolation_batched return vf.gather(-1, points) RuntimeError: index -9223372036854775808 is out of bounds for dimension 3 with size 224

fsalfonzo commented 11 months ago

After spending some time diagnosing the problem, I believe I found the issue.

In Omnipose core.py, there is a line

    # percentile clipping augmentation
    if aug_choices[1]:
        dp = .1 # changed this from 10 to .1, as usual pipleine uses 0.01, 10 was way too high for some images 
        dpct = np.random.triangular(left=0, mode=0, right=dp, size=2) # weighted toward 0
        imgi[k] = utils.normalize99(imgi[k],upper=100-dpct[0],lower=dpct[1])

This routine is engaged on a normalized image. By normalizing a normalized image again, it creates NaN values thus affecting the rest of the code and exiting on an error. I hope this help the community if someone runs into the same issue.

kevinjohncutler commented 9 months ago

@fsalfonzo thanks for reporting this. I haven't seen any issues with normalization, but I will check into it. Looks like you got this on the 5I_crop subset, so that is super helpful for debugging.

kevinjohncutler / omnipose

Trying to fine tune a model but it fails. I need help and guidance. #64