Training our own data through hovernet

pranathivemuri commented 3 years ago

Hi,

I was wondering if we could train our own data through hovernet with our defined instance classes and steps to get there

I have seen an older issue that mentioned that we would have to go through extract_patches as the first step, but I have seen that there are some .mat files expected for annotations? So, I would have to change that to make it work
And if possible could you please discuss steps on where this repo should be modified to run our own training data as it seems like run_train.py command line tool doesn't seem to take in any parameters/variables that one can change

Thanks so much!

simongraham commented 3 years ago

Hi @pranathivemuri ,

You can train with your own data, as long as you ensure that your data is in the correct format. We stored our data as .mat files, but this doesn't mean that you need to. You can see in this script, that you should add your own class that defines the new dataset that you intend to add. Here, you should define load_img() and load_ann() that will determine how to load the images and labels that you supply. Therefore, you are not necessarily restricted to .mat files here and you can modify as wish. You will need to use extract_patches.py, no matter which dataset you use. This will ensure that data is of the correct format and dimensions as expected by the network.

Please see below some steps that should hopefully help:

Add your own custom dataset class in dataset.py and add new dataset option at the bottom of the script here.
Define the new dataset in config here
Supply the correct paths here to images and annotations and then extract patches.
Provide paths of training and validation patches in config.py here and here.
Initialise training with run_train.py

Note, you must ensure that your data is of the correct format. This must be done before doing any of the above. In particular, you must generate an instance map and if you are performing classification, then you must also generate a type/class map per image. You should ensure that when using load_ann() the output should be the concatenated instance and type map of size NxHx2. Take a look at this line for a better understanding.

On a final note, download the CoNSeP dataset to see how the instance and type maps should look. The instance map labels nuclei instances from 1-N, where N is the number of nuclei. The type map labels nuclei instances from 1-C, where C is the number of classes.

Hope this helps :)

pranathivemuri commented 3 years ago

Hi @simongraham Thanks so much for all the detailed steps, we will try these steps out.

Please let me know if I should close this issue for now or please feel free to close it!

pranathivemuri commented 3 years ago

Hi @simongraham @vqdang I have tried to run training using your instructions, everything worked great until the below error. Thanks so much for the instructions again but could you please help me debug the below, should prob_np and true_np be of the same length as it is set by the model after an epoch? What if the model has predicted less classes than what was in the ground truth? would it error out as below? Please let me know if you could what would cause the below error?

It is coming from the line here - https://github.com/vqdang/hover_net/blob/master/models/hovernet/run_desc.py#L283 I added the print statements to see how different the lists were and below is what I have.

prob_np length 1425 true_np length 1504

----------------EPOCH 1
Processing: |##########################1                                                       | 161/504[02:33<05:03, 1.13it/s]Batch = 9.97334|EMA = 12.85959
/code/hovernet_he/models/hovernet/targets.py:33: UserWarning: Only one label was provided to `remove_small_objects`. Did you mean to use a boolean array?
  crop_ann = morph.remove_small_objects(crop_ann, min_size=30)
Processing: |###################################################################################| 504/504[07:55<00:00, 1.06it/s]Batch = 6.11818|EMA = 6.12877
/code/hovernet_he/models/hovernet/run_desc.py:214: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  aligned_shape = np.min(np.array(aligned_shape), axis=0)[1:3]
------train-loss_tp_bce  : 0.34943
------train-loss_tp_dice : 3.44142
------train-loss_np_bce  : 0.12189
------train-loss_np_dice : 0.46985
------train-loss_hv_mse  : 0.63804
------train-loss_hv_msge : 1.10813
------train-overall_loss : 6.12877
------train-lr-net       : 0.00010
Processing: |##################################################################################################################| 90/90[00:43<00:00, 2.05it/s]
prob_np length 1425
true_np length 1504
Traceback (most recent call last):
  File "run_train.py", line 305, in <module>
    trainer.run()
  File "run_train.py", line 289, in run
    phase_info, engine_opt, save_path, prev_log_dir=prev_save_path
  File "run_train.py", line 265, in run_once
    main_runner.run(opt["nr_epochs"])
  File "/code/hovernet_he/run_utils/engine.py", line 197, in run
    self.__trigger_events(Events.EPOCH_COMPLETED)
  File "/code/hovernet_he/run_utils/engine.py", line 123, in __trigger_events
    callback.run(self.state, event)
  File "/code/hovernet_he/run_utils/callbacks/base.py", line 70, in run
    chained=True, nr_epoch=self.nr_epoch, shared_state=state
  File "/code/hovernet_he/run_utils/engine.py", line 197, in run
    self.__trigger_events(Events.EPOCH_COMPLETED)
  File "/code/hovernet_he/run_utils/engine.py", line 123, in __trigger_events
    callback.run(self.state, event)
  File "/code/hovernet_he/run_utils/callbacks/base.py", line 213, in run
    track_dict = self.proc_func(raw_data)
  File "/code/hovernet_he/models/hovernet/opt.py", line 135, in <lambda>
    lambda a: proc_valid_step_output(a, nr_types=nr_type)
  File "/code/hovernet_he/models/hovernet/run_desc.py", line 286, in proc_valid_step_output
    patch_prob_np = prob_np[idx]
IndexError: list index out of range

vqdang commented 3 years ago

@pranathivemuri Most likely happens because one of the batch has batch size of 1. Can you check how many images you have within the test set? For example, 91 images and batch size of 2 will make the code aggregate as [90 + last image size] instead of 91.

pranathivemuri commented 3 years ago

Hi @vqdang! All working fine for now, after I used an even number for the number of test images

0:00, 1.07it/s]Batch = 5.19666|EMA = 5.12571
------train-loss_tp_bce  : 0.29149
------train-loss_tp_dice : 3.37886
------train-loss_np_bce  : 0.10948
------train-loss_np_dice : 0.44088
------train-loss_hv_mse  : 0.18750
------train-loss_hv_msge : 0.71750
------train-overall_loss : 5.12571
------train-lr-net       : 0.00010
Processing: |###################################| 119/119[00:55<00:00, 2.13it/s]
1900
1900
------valid-np_acc    : 0.94497
------valid-np_dice   : 0.71361
------valid-tp_dice_0 : 0.96658
------valid-tp_dice_1 : 0.00827
------valid-tp_dice_2 : 0.50580
------valid-tp_dice_3 : 0.20441
------valid-tp_dice_4 : 0.37238
------valid-hv_mse    : 0.19181
----------------EPOCH 3
Processing: |####1 | 325/475[05:03<02:18, 1.08it/s]Batch = 5.38383|EMA = 4.96389/code/hovernet_he/models/hovernet/targets.py:33: UserWarning: Only one label was provided to `remove_small_objects`. Did you mean to use a boolean array?
  crop_ann = morph.remove_small_objects(crop_ann, min_size=30)
Processing: |####6 | 372/475[05:47<01:37, 1.06it/s]Batch = 4.55204|EMA = 4.90317

pranathivemuri commented 3 years ago

Thanks!

pranathivemuri commented 3 years ago

@simongraham @vqdang Sorry to comment on a closed issue. But my question is not an issue but more of what logs directory means. The logs directory contains 00 and 01 and I am not sure what they mean but when I use the tar file for the 50th epoch from 00 directory, I can see all the 4 classes I have as input annotated but when I use the checkpoint tar from 01 there is only one highest class showing up. Could you please explain what the 00 and 01 directories are? I looked through my history to see if I ran training twice successfully but I don't think I did.

Could it be that 00 and 01 are mult-class segmentation checkpoint and binary checkpoint respectively?

I also tried to trace where the logs directory is and it seems like it is coming from the phase_list in config.py - https://github.com/vqdang/hover_net/blob/be8ae2d621bfbddefd97591ef9df39252e108df9/models/hovernet/opt.py#L28

Please help, thanks so much!

vqdang commented 3 years ago

There are 2 training phases as detailed in the paper, corresponding to 00 and 01 directory you see. Phase 1 (00) we train only the decoder portions, Phase 2 we load the last checkpoint of phase 1 and train the entire model. Both phase 00 and 01 are of the same segmentation mode (instance or instance+typing). Technically you can remove directory 00 because we only use checkpoint from 01.

vqdang / hover_net

Training our own data through hovernet #103