Closed pranathivemuri closed 3 years ago
Hi @pranathivemuri ,
You can train with your own data, as long as you ensure that your data is in the correct format. We stored our data as .mat
files, but this doesn't mean that you need to. You can see in this script, that you should add your own class that defines the new dataset that you intend to add. Here, you should define load_img()
and load_ann()
that will determine how to load the images and labels that you supply. Therefore, you are not necessarily restricted to .mat
files here and you can modify as wish. You will need to use extract_patches.py
, no matter which dataset you use. This will ensure that data is of the correct format and dimensions as expected by the network.
Please see below some steps that should hopefully help:
dataset.py
and add new dataset option at the bottom of the script here.config.py
here and here.run_train.py
Note, you must ensure that your data is of the correct format. This must be done before doing any of the above. In particular, you must generate an instance map and if you are performing classification, then you must also generate a type/class map per image. You should ensure that when using load_ann()
the output should be the concatenated instance and type map of size NxHx2. Take a look at this line for a better understanding.
On a final note, download the CoNSeP dataset to see how the instance and type maps should look. The instance map labels nuclei instances from 1-N, where N is the number of nuclei. The type map labels nuclei instances from 1-C, where C is the number of classes.
Hope this helps :)
Hi @simongraham Thanks so much for all the detailed steps, we will try these steps out.
Please let me know if I should close this issue for now or please feel free to close it!
Hi @simongraham @vqdang I have tried to run training using your instructions, everything worked great until the below error. Thanks so much for the instructions again but could you please help me debug the below, should prob_np and true_np be of the same length as it is set by the model after an epoch? What if the model has predicted less classes than what was in the ground truth? would it error out as below? Please let me know if you could what would cause the below error?
It is coming from the line here - https://github.com/vqdang/hover_net/blob/master/models/hovernet/run_desc.py#L283 I added the print statements to see how different the lists were and below is what I have.
prob_np length 1425 true_np length 1504
----------------EPOCH 1
Processing: |##########################1 | 161/504[02:33<05:03, 1.13it/s]Batch = 9.97334|EMA = 12.85959
/code/hovernet_he/models/hovernet/targets.py:33: UserWarning: Only one label was provided to `remove_small_objects`. Did you mean to use a boolean array?
crop_ann = morph.remove_small_objects(crop_ann, min_size=30)
Processing: |###################################################################################| 504/504[07:55<00:00, 1.06it/s]Batch = 6.11818|EMA = 6.12877
/code/hovernet_he/models/hovernet/run_desc.py:214: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
aligned_shape = np.min(np.array(aligned_shape), axis=0)[1:3]
------train-loss_tp_bce : 0.34943
------train-loss_tp_dice : 3.44142
------train-loss_np_bce : 0.12189
------train-loss_np_dice : 0.46985
------train-loss_hv_mse : 0.63804
------train-loss_hv_msge : 1.10813
------train-overall_loss : 6.12877
------train-lr-net : 0.00010
Processing: |##################################################################################################################| 90/90[00:43<00:00, 2.05it/s]
prob_np length 1425
true_np length 1504
Traceback (most recent call last):
File "run_train.py", line 305, in <module>
trainer.run()
File "run_train.py", line 289, in run
phase_info, engine_opt, save_path, prev_log_dir=prev_save_path
File "run_train.py", line 265, in run_once
main_runner.run(opt["nr_epochs"])
File "/code/hovernet_he/run_utils/engine.py", line 197, in run
self.__trigger_events(Events.EPOCH_COMPLETED)
File "/code/hovernet_he/run_utils/engine.py", line 123, in __trigger_events
callback.run(self.state, event)
File "/code/hovernet_he/run_utils/callbacks/base.py", line 70, in run
chained=True, nr_epoch=self.nr_epoch, shared_state=state
File "/code/hovernet_he/run_utils/engine.py", line 197, in run
self.__trigger_events(Events.EPOCH_COMPLETED)
File "/code/hovernet_he/run_utils/engine.py", line 123, in __trigger_events
callback.run(self.state, event)
File "/code/hovernet_he/run_utils/callbacks/base.py", line 213, in run
track_dict = self.proc_func(raw_data)
File "/code/hovernet_he/models/hovernet/opt.py", line 135, in <lambda>
lambda a: proc_valid_step_output(a, nr_types=nr_type)
File "/code/hovernet_he/models/hovernet/run_desc.py", line 286, in proc_valid_step_output
patch_prob_np = prob_np[idx]
IndexError: list index out of range
@pranathivemuri Most likely happens because one of the batch has batch size of 1. Can you check how many images you have within the test set? For example, 91 images and batch size of 2 will make the code aggregate as [90 + last image size] instead of 91.
Hi @vqdang! All working fine for now, after I used an even number for the number of test images
0:00, 1.07it/s]Batch = 5.19666|EMA = 5.12571
------train-loss_tp_bce : 0.29149
------train-loss_tp_dice : 3.37886
------train-loss_np_bce : 0.10948
------train-loss_np_dice : 0.44088
------train-loss_hv_mse : 0.18750
------train-loss_hv_msge : 0.71750
------train-overall_loss : 5.12571
------train-lr-net : 0.00010
Processing: |###################################| 119/119[00:55<00:00, 2.13it/s]
1900
1900
------valid-np_acc : 0.94497
------valid-np_dice : 0.71361
------valid-tp_dice_0 : 0.96658
------valid-tp_dice_1 : 0.00827
------valid-tp_dice_2 : 0.50580
------valid-tp_dice_3 : 0.20441
------valid-tp_dice_4 : 0.37238
------valid-hv_mse : 0.19181
----------------EPOCH 3
Processing: |####1 | 325/475[05:03<02:18, 1.08it/s]Batch = 5.38383|EMA = 4.96389/code/hovernet_he/models/hovernet/targets.py:33: UserWarning: Only one label was provided to `remove_small_objects`. Did you mean to use a boolean array?
crop_ann = morph.remove_small_objects(crop_ann, min_size=30)
Processing: |####6 | 372/475[05:47<01:37, 1.06it/s]Batch = 4.55204|EMA = 4.90317
Thanks!
@simongraham @vqdang Sorry to comment on a closed issue. But my question is not an issue but more of what logs directory means. The logs directory contains 00
and 01
and I am not sure what they mean but when I use the tar file for the 50th epoch from 00 directory, I can see all the 4 classes I have as input annotated but when I use the checkpoint tar from 01
there is only one highest class showing up. Could you please explain what the 00 and 01 directories are? I looked through my history to see if I ran training twice successfully but I don't think I did.
Could it be that 00 and 01 are mult-class segmentation checkpoint and binary checkpoint respectively?
I also tried to trace where the logs directory is and it seems like it is coming from the phase_list in config.py - https://github.com/vqdang/hover_net/blob/be8ae2d621bfbddefd97591ef9df39252e108df9/models/hovernet/opt.py#L28
Please help, thanks so much!
There are 2 training phases as detailed in the paper, corresponding to 00 and 01 directory you see. Phase 1 (00) we train only the decoder portions, Phase 2 we load the last checkpoint of phase 1 and train the entire model. Both phase 00 and 01 are of the same segmentation mode (instance or instance+typing). Technically you can remove directory 00 because we only use checkpoint from 01.
Hi,
I was wondering if we could train our own data through hovernet with our defined instance classes and steps to get there
Thanks so much!