mykcs / MedSAM_lite-0424

Apache License 2.0
0 stars 0 forks source link

预期运行epoch 10,实际只运行1个epoch #3

Open mykcs opened 2 months ago

mykcs commented 2 months ago

train_mps-4.py start_epoch = 0,num_epochs = 10 验证了不是这里出现的问题

mykcs commented 2 months ago

gemini

有几种情况会导致这段代码只运行一个 epoch,而不是预期的10个epoch。下面是一些可能的原因和检查方法:

1. 数据集大小和 train_loader

2. 隐藏的 Break 语句

3. 异常或错误

4. 进度条相关的问题

排查方法

  1. 输出每一步/每个epoch的信息:for step, batch in enumerate(pbar) 循环开始之前,以及 for epoch in range(start_epoch + 1, num_epochs + 1) 循环结束之后,分别打印 epochstep的值。
  2. 调试器: 使用调试器逐步运行代码,来追踪变量的变化以及是否在意外的地方退出了循环。

如果你能提供更多关于以下这些问题的细节,我就能给你更精准的诊断

mykcs commented 2 months ago

应该和这些有关

parser.add_argument(
    "--sanity_check", action="store_true",
    help="Whether to do sanity check for dataloading."
)
if do_sancheck:
    tr_dataset = NpyDataset(data_root, data_aug=True)
    tr_dataloader = DataLoader(tr_dataset, batch_size=8, shuffle=True)
    for step, batch in enumerate(tr_dataloader):
        # show the example
        _, axs = plt.subplots(1, 2, figsize=(10, 10))
        idx = random.randint(0, 4)

        image = batch["image"]
        gt = batch["gt2D"]
        bboxes = batch["bboxes"]
        names_temp = batch["image_name"]

        axs[0].imshow(image[idx].cpu().permute(1, 2, 0).numpy())
        show_mask(gt[idx].cpu().squeeze().numpy(), axs[0])
        show_box(bboxes[idx].numpy().squeeze(), axs[0])
        axs[0].axis('off')
        # set title
        axs[0].set_title(names_temp[idx])
        idx = random.randint(4, 7)
        axs[1].imshow(image[idx].cpu().permute(1, 2, 0).numpy())
        show_mask(gt[idx].cpu().squeeze().numpy(), axs[1])
        show_box(bboxes[idx].numpy().squeeze(), axs[1])
        axs[1].axis('off')
        # set title
        axs[1].set_title(names_temp[idx])
        plt.subplots_adjust(wspace=0.01, hspace=0)
        plt.savefig(
            join(work_dir, 'medsam_lite-train_bbox_prompt_sanitycheck_DA.png'),
            bbox_inches='tight',
            dpi=300
        )
        plt.close()
        break
mykcs commented 2 months ago

train_mps-5.py 新改动

parser.add_argument(
    "--sanity_check", action="store_true", default=False,
    help="Whether to do sanity check for dataloading."
)
mykcs commented 2 months ago

仍然是1个epoch log https://wandb.ai/team-mykcs/MedSAM-lite/runs/388c0cs6/logs

The error "UnboundLocalError: local variable 'best_loss' referenced before assignment" means that you're trying to use the variable best_loss before it has been given a value within the scope where you're using it. Here's how to fix this and understand why it's happening:

Understanding the Error

Solution

  1. Initialize best_loss: Before the start of the training loop (likely before start_epoch = 0), initialize the best_loss variable with a very large number to ensure the initial comparison will be true:

    best_loss = float('inf')  # Or a large starting value

Additional Considerations

Let me know if you'd like help pinpointing the exact location in your code to make the fix!

mykcs commented 2 months ago

train_mps-6.py

把更多非def code放入def main里

增加print(f"if checkpoint else here")

parser.add_argument(
        # "-work_dir", type=str, default="./workdir",
        "-work_dir", type=str, default="./work_dir",
        help="Path to the working directory where checkpoints and logs will be saved."
    )
mykcs commented 2 months ago

出现了model

image

log 出现了新错误,甚至很开心

/Users/myk/anaconda3/envs/env-MedSAM-0422/bin/python /Users/myk/PyPjcts/MedSAM_lite-0424/train_mps-6.py 
wandb: Currently logged in as: mykcs (team-mykcs). Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.16.6
wandb: Run data is saved locally in /Users/myk/PyPjcts/MedSAM_lite-0424/wandb/run-20240425_023642-mcvsmfi1
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run run-6
wandb: ⭐️ View project at https://wandb.ai/team-mykcs/MedSAM-lite
wandb: 🚀 View run at https://wandb.ai/team-mykcs/MedSAM-lite/runs/mcvsmfi1
Finetuning with pretrained weights lite_medsam.pth
MedSAM Lite size: 9791300
if checkpoint else here
Epoch 1 at 2024-04-25 02:49:08, loss: 0.0168: 100%|██████████| 906/906 [13:03<00:00,  1.16it/s]
New best loss: 10000000000.0000 -> 0.0538
Traceback (most recent call last):
  File "/Users/myk/PyPjcts/MedSAM_lite-0424/train_mps-6.py", line 538, in <module>
    main()
  File "/Users/myk/PyPjcts/MedSAM_lite-0424/train_mps-6.py", line 531, in main
    plt.savefig(join(model_save_path, args.task_name, work_dir, "train_loss.png"))
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/matplotlib/pyplot.py", line 1134, in savefig
    res = fig.savefig(*args, **kwargs)  # type: ignore[func-returns-value]
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/matplotlib/figure.py", line 3390, in savefig
    self.canvas.print_figure(fname, **kwargs)
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/matplotlib/backend_bases.py", line 2193, in print_figure
    result = print_method(
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/matplotlib/backend_bases.py", line 2043, in <lambda>
    print_method = functools.wraps(meth)(lambda *args, **kwargs: meth(
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/matplotlib/backends/backend_agg.py", line 497, in print_png
    self._print_pil(filename_or_obj, "png", pil_kwargs, metadata)
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/matplotlib/backends/backend_agg.py", line 446, in _print_pil
    mpl.image.imsave(
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/matplotlib/image.py", line 1656, in imsave
    image.save(fname, **pil_kwargs)
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/PIL/Image.py", line 2456, in save
    fp = builtins.open(filename, "w+b")
FileNotFoundError: [Errno 2] No such file or directory: '/Users/myk/PyPjcts/MedSAM_lite-0424/work_dir/MedSAM-lite-20240425-0236/MedSAM-lite/work_dir/train_loss.png'
wandb: 
wandb: Run history:
wandb:   ce_loss_weight ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:       epoch_loss ▁
wandb: epoch_loss[step] ▁▁▃▄▃▃▃█▂▃▄▅▁▃▃▁▂▄▂▂▁▃▃▅▁▆▂▁▃▄▃▃▂▃▅█▄▃▁▃
wandb:           iou_gt ▁
wandb:  iou_loss_weight ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:             l_ce ▂▁▄▃▄▂▁▂▅█▄▂▃▂▄▂▄▁▄▆▂▂▅▃▁▂▇▂▁▂▂▂▁▁▂▂▄▂▁▆
wandb:            l_iou ▂▂▁▁▂▂▃█▁▂▃▂▁▁▂▁▁▄▁▂▁▂▂▁▁▂▁▁▂▂▅▂▂▂▃▃▂▁▁▂
wandb:            l_seg ▁▁▂▄▃▃▃▇▁▂▄▅▁▃▃▁▂▄▂▂▁▃▃▅▁▆▂▁▃▄▂▂▂▃▅█▄▃▁▃
wandb:             loss ▁▁▃▄▃▃▃█▂▃▄▅▁▃▃▁▂▄▂▂▁▃▃▅▁▆▂▁▃▄▃▃▂▃▅█▄▃▁▃
wandb:      loss.item() ▁▁▃▄▃▃▃█▂▃▄▅▁▃▃▁▂▄▂▂▁▃▃▅▁▆▂▁▃▄▃▃▂▃▅█▄▃▁▃
wandb:        mask_loss ▁▁▃▄▃▃▃▇▂▃▄▅▁▃▃▁▃▄▂▂▁▃▃▅▁▆▂▁▃▄▂▂▂▃▅█▅▃▁▃
wandb:  seg_loss_weight ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: 
wandb: Run summary:
wandb:   ce_loss_weight 1.0
wandb:       epoch_loss 0.05377
wandb: epoch_loss[step] 0.01675
wandb:           iou_gt 0.9632
wandb:  iou_loss_weight 1.0
wandb:             l_ce 0.00258
wandb:            l_iou 0.0
wandb:            l_seg 0.01417
wandb:             loss 0.01675
wandb:      loss.item() 0.01675
wandb:        mask_loss 0.01675
wandb:  seg_loss_weight 1.0
wandb: 
wandb: 🚀 View run run-6 at: https://wandb.ai/team-mykcs/MedSAM-lite/runs/mcvsmfi1
wandb: ⭐️ View project at: https://wandb.ai/team-mykcs/MedSAM-lite
wandb: Synced 6 W&B file(s), 0 media file(s), 2 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20240425_023642-mcvsmfi1/logs
Exception in thread IntMsgThr:
Traceback (most recent call last):
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/wandb_run.py", line 300, in check_internal_messages
    self._loop_check_status(
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status
    local_handle = request()
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/interface/interface.py", line 856, in deliver_internal_messages
    return self._deliver_internal_messages(internal_message)
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/interface/interface_shared.py", line 516, in _deliver_internal_messages
    return self._deliver_record(record)
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/interface/interface_shared.py", line 459, in _deliver_record
    handle = mailbox._deliver_record(record, interface=self)
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
    interface._publish(record)
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
    self._sock_client.send_record_publish(record)
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
    self.send_server_request(server_req)
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
    self._send_message(msg)
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
    self._sendall_with_error_handle(header + data)
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
    sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe

进程已结束,退出代码为 1
mykcs commented 2 months ago

gemini

错误分析

你遇到的错误主要有:

下面是详细的解释和解决方法:

1. FileNotFoundError: [Errno 2] No such file or directory

2. BrokenPipeError: [Errno 32] Broken pipe

修改建议

import os

# ... your code ...

def main():
    # ... other code ...

    model_save_path = join(args.work_dir, args.task_name, work_dir)
    os.makedirs(model_save_path, exist_ok=True)  # Ensure directories exist

    plt.savefig(join(model_save_path, "train_loss.png"))

    # ... other code ...

说明

mykcs commented 2 months ago

train_mps-7.py 更改了一些 model_save_path

mykcs commented 2 months ago

成了! 难道之前真的是因为model_save_path错误中断了吗。 可以通过跑mps-6.py,复现bug,找到问题所在。 因为时间关系(不求甚解),暂时搁置。