ljwztc / CLIP-Driven-Universal-Model

[ICCV 2023] CLIP-Driven Universal Model; Rank first in MSD Competition.
Other
565 stars 67 forks source link

training on parts of the datasets #52

Closed sharonlee12 closed 4 months ago

sharonlee12 commented 9 months ago

Hello, I want to train some datasets in PAOT.txt, such as 01 Multi Atlas Labeling and 02 TCIA Pancreas-CT and 03_ CHAOS, may I ask how to modify the label_transfer.py?The TRANSFER_LIST should be what?TRANSFER_LIST = ['01', '02', '03']?and do I need to modify the num_classes?if so,What should I change it to? I tried to train on the BTCV dataset, but my diceloss and bce loss did not decrease, and the label_transfer.py is set like this settings

I hope to receive your guidance, thank you!

sharonlee12 commented 9 months ago

I use with torch.no_grad() in training,I do this because of the"CUDA out of memory." question.Does it matter? 22

ljwztc commented 9 months ago
  1. Yes. you can do the label_transfer step only with 01, 02, 03.
  2. it matters. training require gradient. torch.no_grad() should be canceled during training
sharonlee12 commented 9 months ago
  1. Yes. you can do the label_transfer step only with 01, 02, 03.
  2. it matters. training require gradient. torch.no_grad() should be canceled during training

Hello, nice to receive your reply! I have resolved the issue of cuda out of memory and used with torch. cuda. amp. autocast(),But I am training on BTCV, where diceloss oscillates continuously without decreasing, while celoss decreases. When I use my checkpoint to validate, the result is as follows: Spleen: dice 0.0000, recall 0.0000, precision nan Right Kidney: dice 0.0000, recall 0.0000, precision nan Left Kidney: dice 0.0000, recall 0.0000, precision nan Esophagus: dice 0.0000, recall 0.0000, precision nan Liver: dice 0.0000, recall 0.0000, precision nan Stomach: dice 0.0000, recall 0.0000, precision nan Aorta: dice 0.0000, recall 0.0000, precision nan Postcava: dice 0.0000, recall 0.0000, precision nan Portal Vein and Splenic Vein: dice 0.0000, recall 0.0000, precision nan Pancreas: dice 0.0000, recall 0.0000, precision nan Right Adrenal Gland: dice 0.0000, recall 0.0000, precision nan Left Adrenal Gland: dice 0.0000, recall 0.0000, precision nan case01_Multi-Atlas_Labeling/label/label0035| Spleen: 0.0000, Right Kidney: 0.0000, Left Kidney: 0.0000, Eso phagus: 0.0000, Liver: 0.0000, Stomach: 0.0000, Aorta: 0.0000, Postcava: 0.0000, Portal Vein and Splenic Ve in: 0.0000, Pancreas: 0.0000, Right Adrenal Gland: 0.0000, Left Adrenal Gland: 0.0000, Have you ever encountered a similar problem? I hope to receive your reply, thank you! Here is codes for training:

`` def train(args, train_loader, model, optimizer, loss_seg_DICE, loss_seg_CE): model.train() loss_bce_ave = 0 loss_dice_ave = 0 epoch_iterator = tqdm( train_loader, desc="Training (X / X Steps) (loss=X.X)", dynamic_ncols=True ) for step, batch in enumerate(epoch_iterator): x, y, name = batch["image"].to(args.device), batch["post_label"].float().to(args.device), batch['name'] torch.cuda.empty_cache() with torch.cuda.amp.autocast(): logit_map = model(x) torch.cuda.empty_cache()

    term_seg_Dice = loss_seg_DICE.forward(logit_map, y, name, TEMPLATE)
    term_seg_BCE = loss_seg_CE.forward(logit_map, y, name, TEMPLATE)
    loss = term_seg_BCE + term_seg_Dice
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    epoch_iterator.set_description(
        "Epoch=%d: Training (%d / %d Steps) (dice_loss=%2.5f, bce_loss=%2.5f)" % (
            args.epoch, step, len(train_loader), term_seg_Dice.item(), term_seg_BCE.item())
    )
    loss_bce_ave += term_seg_BCE.item()
    loss_dice_ave += term_seg_Dice.item()
    torch.cuda.empty_cache()
print('Epoch=%d: ave_dice_loss=%2.5f, ave_bce_loss=%2.5f' % (args.epoch, loss_dice_ave/len(epoch_iterator), loss_bce_ave/len(epoch_iterator)))

return loss_dice_ave/len(epoch_iterator), loss_bce_ave/len(epoch_iterator)``
sharonlee12 commented 9 months ago

Here is the picture of training loss: image image

Adoreeeeee commented 9 months ago

Hello, I want to train some datasets in PAOT.txt, such as 01 Multi Atlas Labeling and 02 TCIA Pancreas-CT and 03_ CHAOS, may I ask how to modify the label_transfer.py?The TRANSFER_LIST should be what?TRANSFER_LIST = ['01', '02', '03']?and do I need to modify the num_classes?if so,What should I change it to? I tried to train on the BTCV dataset, but my diceloss and bce loss did not decrease, and the label_transfer.py is set like this settings

I hope to receive your guidance, thank you!

Hello! I'm having a similar issue with you. I used a new backbone and a new dataset. I did everything I could, but when I trained, diceloss=1 and bceloss dropped.

ljwztc commented 9 months ago

I haven't met this before. Usually, the bce loss will decrease to less than 0.01 and the dice loss will decrease to roughly 0.7

sharonlee12 commented 9 months ago

I haven't met this before. Usually, the bce loss will decrease to less than 0.01 and the dice loss will decrease to roughly 0.7

Hello, may I ask if you trained on PAOT.txt? That is to say, for all datasets, I wonder if this is due to the small size of the dataset, as I only train on BTCV. Thank you!

sharonlee12 commented 6 months ago

I haven't met this before. Usually, the bce loss will decrease to less than 0.01 and the dice loss will decrease to roughly 0.7

Hello, may I ask if you trained on PAOT.txt? That is to say, for all datasets, I wonder if this is due to the small size of the dataset, as I only train on BTCV. Thank you!

No,I just train on BTCV,because PAOT.xt has a large dataset size

ljwztc commented 4 months ago
  1. As for dice loss: The bug in the dice loss calculation has been addressed at this link. Consequently, we can now observe an expected decrease in the dice loss.
  2. As for convergence: The convergence of this model may require a substantial dataset, although it is not necessarily the sole contributing factor.