预期运行epoch 10，实际只运行1个epoch

mykcs commented 2 months ago

train_mps-4.py start_epoch = 0，num_epochs = 10 验证了不是这里出现的问题

mykcs commented 2 months ago

gemini

有几种情况会导致这段代码只运行一个 epoch，而不是预期的10个epoch。下面是一些可能的原因和检查方法：

1. 数据集大小和 train_loader

检查 train_loader 的长度: 使用 len(train_loader) 来确定在一个epoch中 train_loader会迭代多少步。如果它远远小于你预期的步数，可能你的数据集比较小，一次迭代就遍历完了。
数据集的不平衡: 如果你的数据集极度不平衡（比如有极少数的样本）， train_loader可能会很快地处理完这些样本，导致每个 epoch 提前结束。

2. 隐藏的 Break 语句

仔细检查循环内是否有'break'语句: 有可能存在条件判断，在满足特定条件时使用 break 提前退出了 for step, batch in enumerate(pbar) 循环。

3. 异常或错误

注意异常: 这段代码中存在 try-except语句吗？如果在执行过程当中有异常抛出，却没有合适的异常处理，也会导致提前终止。
调试运行: 使用调试器（debugger）一步步执行代码，观察 epoch 变量的变化以及循环是否正常退出。

4. 进度条相关的问题

检查 tqdm 的使用: 确保 pbar在初始化的时候被正确设置了总步数。如果tqdm 的总长度设置不正确，进度条可能显示完成，但实际上只执行了一个epoch。

排查方法

输出每一步/每个epoch的信息: 在 for step, batch in enumerate(pbar) 循环开始之前，以及 for epoch in range(start_epoch + 1, num_epochs + 1) 循环结束之后，分别打印 epoch 和step的值。
调试器: 使用调试器逐步运行代码，来追踪变量的变化以及是否在意外的地方退出了循环。

如果你能提供更多关于以下这些问题的细节，我就能给你更精准的诊断

train_loader 的定义
数据集的大小
完整的代码（包括导入的模块声明）

mykcs commented 2 months ago

应该和这些有关

parser.add_argument(
    "--sanity_check", action="store_true",
    help="Whether to do sanity check for dataloading."
)

if do_sancheck:
    tr_dataset = NpyDataset(data_root, data_aug=True)
    tr_dataloader = DataLoader(tr_dataset, batch_size=8, shuffle=True)
    for step, batch in enumerate(tr_dataloader):
        # show the example
        _, axs = plt.subplots(1, 2, figsize=(10, 10))
        idx = random.randint(0, 4)

        image = batch["image"]
        gt = batch["gt2D"]
        bboxes = batch["bboxes"]
        names_temp = batch["image_name"]

        axs[0].imshow(image[idx].cpu().permute(1, 2, 0).numpy())
        show_mask(gt[idx].cpu().squeeze().numpy(), axs[0])
        show_box(bboxes[idx].numpy().squeeze(), axs[0])
        axs[0].axis('off')
        # set title
        axs[0].set_title(names_temp[idx])
        idx = random.randint(4, 7)
        axs[1].imshow(image[idx].cpu().permute(1, 2, 0).numpy())
        show_mask(gt[idx].cpu().squeeze().numpy(), axs[1])
        show_box(bboxes[idx].numpy().squeeze(), axs[1])
        axs[1].axis('off')
        # set title
        axs[1].set_title(names_temp[idx])
        plt.subplots_adjust(wspace=0.01, hspace=0)
        plt.savefig(
            join(work_dir, 'medsam_lite-train_bbox_prompt_sanitycheck_DA.png'),
            bbox_inches='tight',
            dpi=300
        )
        plt.close()
        break

mykcs commented 2 months ago

train_mps-5.py 新改动

parser.add_argument(
    "--sanity_check", action="store_true", default=False,
    help="Whether to do sanity check for dataloading."
)

mykcs commented 2 months ago

仍然是1个epoch log https://wandb.ai/team-mykcs/MedSAM-lite/runs/388c0cs6/logs

The error "UnboundLocalError: local variable 'best_loss' referenced before assignment" means that you're trying to use the variable best_loss before it has been given a value within the scope where you're using it. Here's how to fix this and understand why it's happening:

Understanding the Error

Local Scope: Variables in Python generally have a local scope. This means they only exist and have values within the function or code block where they are defined.

Conditional Assignment: Currently, best_loss is only assigned a value inside this conditional block:

if epoch_loss_reduced < best_loss:
   print(f"New best loss: {best_loss:.4f} -> {epoch_loss_reduced:.4f}")
   best_loss = epoch_loss_reduced
   # ...

Problem: The code attempts to save the best_loss in the checkpoint right after this block:
```
checkpoint = {
  #...
   "best_loss": best_loss,
}
```
If the condition epoch_loss_reduced < best_loss is never True during the first epoch, then the best_loss variable won't be assigned a value at all, leading to the error when you try to use it.

Solution

Initialize best_loss: Before the start of the training loop (likely before start_epoch = 0), initialize the best_loss variable with a very large number to ensure the initial comparison will be true:
```
best_loss = float('inf')  # Or a large starting value
```

Additional Considerations

Error Location: The error indicates that the issue is likely occurring within the main() function. Check the exact code section surrounding lines 400 and 535 to verify where the best_loss variable is being used and assigned.

Let me know if you'd like help pinpointing the exact location in your code to make the fix!

mykcs commented 2 months ago

train_mps-6.py

把更多非def code放入def main里

增加print(f"if checkpoint else here")

parser.add_argument(
        # "-work_dir", type=str, default="./workdir",
        "-work_dir", type=str, default="./work_dir",
        help="Path to the working directory where checkpoints and logs will be saved."
    )

mykcs commented 2 months ago

出现了model

log 出现了新错误，甚至很开心

/Users/myk/anaconda3/envs/env-MedSAM-0422/bin/python /Users/myk/PyPjcts/MedSAM_lite-0424/train_mps-6.py 
wandb: Currently logged in as: mykcs (team-mykcs). Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.16.6
wandb: Run data is saved locally in /Users/myk/PyPjcts/MedSAM_lite-0424/wandb/run-20240425_023642-mcvsmfi1
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run run-6
wandb: ⭐️ View project at https://wandb.ai/team-mykcs/MedSAM-lite
wandb: 🚀 View run at https://wandb.ai/team-mykcs/MedSAM-lite/runs/mcvsmfi1
Finetuning with pretrained weights lite_medsam.pth
MedSAM Lite size: 9791300
if checkpoint else here
Epoch 1 at 2024-04-25 02:49:08, loss: 0.0168: 100%|██████████| 906/906 [13:03<00:00,  1.16it/s]
New best loss: 10000000000.0000 -> 0.0538
Traceback (most recent call last):
  File "/Users/myk/PyPjcts/MedSAM_lite-0424/train_mps-6.py", line 538, in <module>
    main()
  File "/Users/myk/PyPjcts/MedSAM_lite-0424/train_mps-6.py", line 531, in main
    plt.savefig(join(model_save_path, args.task_name, work_dir, "train_loss.png"))
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/matplotlib/pyplot.py", line 1134, in savefig
    res = fig.savefig(*args, **kwargs)  # type: ignore[func-returns-value]
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/matplotlib/figure.py", line 3390, in savefig
    self.canvas.print_figure(fname, **kwargs)
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/matplotlib/backend_bases.py", line 2193, in print_figure
    result = print_method(
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/matplotlib/backend_bases.py", line 2043, in <lambda>
    print_method = functools.wraps(meth)(lambda *args, **kwargs: meth(
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/matplotlib/backends/backend_agg.py", line 497, in print_png
    self._print_pil(filename_or_obj, "png", pil_kwargs, metadata)
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/matplotlib/backends/backend_agg.py", line 446, in _print_pil
    mpl.image.imsave(
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/matplotlib/image.py", line 1656, in imsave
    image.save(fname, **pil_kwargs)
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/PIL/Image.py", line 2456, in save
    fp = builtins.open(filename, "w+b")
FileNotFoundError: [Errno 2] No such file or directory: '/Users/myk/PyPjcts/MedSAM_lite-0424/work_dir/MedSAM-lite-20240425-0236/MedSAM-lite/work_dir/train_loss.png'
wandb: 
wandb: Run history:
wandb:   ce_loss_weight ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:       epoch_loss ▁
wandb: epoch_loss[step] ▁▁▃▄▃▃▃█▂▃▄▅▁▃▃▁▂▄▂▂▁▃▃▅▁▆▂▁▃▄▃▃▂▃▅█▄▃▁▃
wandb:           iou_gt ▁
wandb:  iou_loss_weight ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:             l_ce ▂▁▄▃▄▂▁▂▅█▄▂▃▂▄▂▄▁▄▆▂▂▅▃▁▂▇▂▁▂▂▂▁▁▂▂▄▂▁▆
wandb:            l_iou ▂▂▁▁▂▂▃█▁▂▃▂▁▁▂▁▁▄▁▂▁▂▂▁▁▂▁▁▂▂▅▂▂▂▃▃▂▁▁▂
wandb:            l_seg ▁▁▂▄▃▃▃▇▁▂▄▅▁▃▃▁▂▄▂▂▁▃▃▅▁▆▂▁▃▄▂▂▂▃▅█▄▃▁▃
wandb:             loss ▁▁▃▄▃▃▃█▂▃▄▅▁▃▃▁▂▄▂▂▁▃▃▅▁▆▂▁▃▄▃▃▂▃▅█▄▃▁▃
wandb:      loss.item() ▁▁▃▄▃▃▃█▂▃▄▅▁▃▃▁▂▄▂▂▁▃▃▅▁▆▂▁▃▄▃▃▂▃▅█▄▃▁▃
wandb:        mask_loss ▁▁▃▄▃▃▃▇▂▃▄▅▁▃▃▁▃▄▂▂▁▃▃▅▁▆▂▁▃▄▂▂▂▃▅█▅▃▁▃
wandb:  seg_loss_weight ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: 
wandb: Run summary:
wandb:   ce_loss_weight 1.0
wandb:       epoch_loss 0.05377
wandb: epoch_loss[step] 0.01675
wandb:           iou_gt 0.9632
wandb:  iou_loss_weight 1.0
wandb:             l_ce 0.00258
wandb:            l_iou 0.0
wandb:            l_seg 0.01417
wandb:             loss 0.01675
wandb:      loss.item() 0.01675
wandb:        mask_loss 0.01675
wandb:  seg_loss_weight 1.0
wandb: 
wandb: 🚀 View run run-6 at: https://wandb.ai/team-mykcs/MedSAM-lite/runs/mcvsmfi1
wandb: ⭐️ View project at: https://wandb.ai/team-mykcs/MedSAM-lite
wandb: Synced 6 W&B file(s), 0 media file(s), 2 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20240425_023642-mcvsmfi1/logs
Exception in thread IntMsgThr:
Traceback (most recent call last):
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/wandb_run.py", line 300, in check_internal_messages
    self._loop_check_status(
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status
    local_handle = request()
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/interface/interface.py", line 856, in deliver_internal_messages
    return self._deliver_internal_messages(internal_message)
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/interface/interface_shared.py", line 516, in _deliver_internal_messages
    return self._deliver_record(record)
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/interface/interface_shared.py", line 459, in _deliver_record
    handle = mailbox._deliver_record(record, interface=self)
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
    interface._publish(record)
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
    self._sock_client.send_record_publish(record)
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
    self.send_server_request(server_req)
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
    self._send_message(msg)
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
    self._sendall_with_error_handle(header + data)
  File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
    sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe

进程已结束，退出代码为 1

mykcs commented 2 months ago

gemini

错误分析

你遇到的错误主要有：

FileNotFoundError: 程序试图将训练损失的图像 (train_loss.png) 保存到一个不存在的目录中。
BrokenPipeError: Weights & Biases（WandB）在同步数据时出现了连接断开的问题。

下面是详细的解释和解决方法：

1. FileNotFoundError: [Errno 2] No such file or directory

问题: FileNotFoundError 表明代码试图写入的文件路径不正确:

FileNotFoundError: [Errno 2] No such file or directory: '/Users/myk/PyPjcts/MedSAM_lite-0424/work_dir/MedSAM-lite-20240425-0236/MedSAM-lite/work_dir/train_loss.png'

解决方法:
1. 检查代码: 确保在保存 train_loss.png 之前，指定的路径已存在。你可能需要使用 os.makedirs(..., exist_ok=True) 来创建必要的目录结构。
2. 检查权限: 确保你的程序对要保存图像的目录具有写权限。

2. BrokenPipeError: [Errno 32] Broken pipe

问题: Weights & Biases（WandB）的同步进程异常中断。这可能是由于网络问题或与 WandB 服务器连接的临时故障造成的。
解决方法:
1. 网络连接: 检查你的互联网连接是否稳定。
2. 重试: 很多时候，这类问题会自动解决，你可以先等一段时间再重试运行你的代码。
3. WandB Support: 如果问题依然存在，你可以参考 WandB 的支持文档或者联系他们的技术支持。

修改建议

import os

# ... your code ...

def main():
    # ... other code ...

    model_save_path = join(args.work_dir, args.task_name, work_dir)
    os.makedirs(model_save_path, exist_ok=True)  # Ensure directories exist

    plt.savefig(join(model_save_path, "train_loss.png"))

    # ... other code ...

说明

增加了 os.makedirs(model_save_path, exist_ok=True) 这一行，在保存图片之前创建路径，避免 FileNotFoundError 错误。

mykcs commented 2 months ago

train_mps-7.py 更改了一些 model_save_path

mykcs commented 2 months ago

成了！难道之前真的是因为model_save_path错误中断了吗。可以通过跑mps-6.py，复现bug，找到问题所在。因为时间关系（不求甚解），暂时搁置。

mykcs / MedSAM_lite-0424

预期运行epoch 10，实际只运行1个epoch #3

错误分析