Open mykcs opened 2 months ago
gemini
有几种情况会导致这段代码只运行一个 epoch,而不是预期的10个epoch。下面是一些可能的原因和检查方法:
1. 数据集大小和 train_loader
train_loader
的长度: 使用 len(train_loader)
来确定在一个epoch中 train_loader
会迭代多少步。如果它远远小于你预期的步数,可能你的数据集比较小,一次迭代就遍历完了。 train_loader
可能会很快地处理完这些样本,导致每个 epoch 提前结束。2. 隐藏的 Break 语句
break
提前退出了 for step, batch in enumerate(pbar)
循环。 3. 异常或错误
try-except
语句吗?如果在执行过程当中有异常抛出,却没有合适的异常处理,也会导致提前终止。epoch
变量的变化以及循环是否正常退出。4. 进度条相关的问题
tqdm
的使用: 确保 pbar
在初始化的时候被正确设置了总步数。如果tqdm
的总长度设置不正确,进度条可能显示完成,但实际上只执行了一个epoch。排查方法
for step, batch in enumerate(pbar)
循环开始之前,以及 for epoch in range(start_epoch + 1, num_epochs + 1)
循环结束之后,分别打印 epoch
和step
的值。如果你能提供更多关于以下这些问题的细节,我就能给你更精准的诊断
train_loader
的定义应该和这些有关
parser.add_argument(
"--sanity_check", action="store_true",
help="Whether to do sanity check for dataloading."
)
if do_sancheck:
tr_dataset = NpyDataset(data_root, data_aug=True)
tr_dataloader = DataLoader(tr_dataset, batch_size=8, shuffle=True)
for step, batch in enumerate(tr_dataloader):
# show the example
_, axs = plt.subplots(1, 2, figsize=(10, 10))
idx = random.randint(0, 4)
image = batch["image"]
gt = batch["gt2D"]
bboxes = batch["bboxes"]
names_temp = batch["image_name"]
axs[0].imshow(image[idx].cpu().permute(1, 2, 0).numpy())
show_mask(gt[idx].cpu().squeeze().numpy(), axs[0])
show_box(bboxes[idx].numpy().squeeze(), axs[0])
axs[0].axis('off')
# set title
axs[0].set_title(names_temp[idx])
idx = random.randint(4, 7)
axs[1].imshow(image[idx].cpu().permute(1, 2, 0).numpy())
show_mask(gt[idx].cpu().squeeze().numpy(), axs[1])
show_box(bboxes[idx].numpy().squeeze(), axs[1])
axs[1].axis('off')
# set title
axs[1].set_title(names_temp[idx])
plt.subplots_adjust(wspace=0.01, hspace=0)
plt.savefig(
join(work_dir, 'medsam_lite-train_bbox_prompt_sanitycheck_DA.png'),
bbox_inches='tight',
dpi=300
)
plt.close()
break
train_mps-5.py
新改动
parser.add_argument(
"--sanity_check", action="store_true", default=False,
help="Whether to do sanity check for dataloading."
)
仍然是1个epoch log https://wandb.ai/team-mykcs/MedSAM-lite/runs/388c0cs6/logs
The error "UnboundLocalError: local variable 'best_loss' referenced before assignment" means that you're trying to use the variable best_loss
before it has been given a value within the scope where you're using it. Here's how to fix this and understand why it's happening:
Understanding the Error
Local Scope: Variables in Python generally have a local scope. This means they only exist and have values within the function or code block where they are defined.
Conditional Assignment: Currently, best_loss
is only assigned a value inside this conditional block:
if epoch_loss_reduced < best_loss:
print(f"New best loss: {best_loss:.4f} -> {epoch_loss_reduced:.4f}")
best_loss = epoch_loss_reduced
# ...
Problem: The code attempts to save the best_loss
in the checkpoint right after this block:
checkpoint = {
#...
"best_loss": best_loss,
}
If the condition epoch_loss_reduced < best_loss
is never True during the first epoch, then the best_loss
variable won't be assigned a value at all, leading to the error when you try to use it.
Solution
Initialize best_loss
: Before the start of the training loop (likely before start_epoch = 0
), initialize the best_loss
variable with a very large number to ensure the initial comparison will be true:
best_loss = float('inf') # Or a large starting value
Additional Considerations
main()
function. Check the exact code section surrounding lines 400 and 535 to verify where the best_loss
variable is being used and assigned. Let me know if you'd like help pinpointing the exact location in your code to make the fix!
train_mps-6.py
把更多非def code放入def main里
增加print(f"if checkpoint else here")
parser.add_argument(
# "-work_dir", type=str, default="./workdir",
"-work_dir", type=str, default="./work_dir",
help="Path to the working directory where checkpoints and logs will be saved."
)
出现了model
log 出现了新错误,甚至很开心
/Users/myk/anaconda3/envs/env-MedSAM-0422/bin/python /Users/myk/PyPjcts/MedSAM_lite-0424/train_mps-6.py
wandb: Currently logged in as: mykcs (team-mykcs). Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.16.6
wandb: Run data is saved locally in /Users/myk/PyPjcts/MedSAM_lite-0424/wandb/run-20240425_023642-mcvsmfi1
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run run-6
wandb: ⭐️ View project at https://wandb.ai/team-mykcs/MedSAM-lite
wandb: 🚀 View run at https://wandb.ai/team-mykcs/MedSAM-lite/runs/mcvsmfi1
Finetuning with pretrained weights lite_medsam.pth
MedSAM Lite size: 9791300
if checkpoint else here
Epoch 1 at 2024-04-25 02:49:08, loss: 0.0168: 100%|██████████| 906/906 [13:03<00:00, 1.16it/s]
New best loss: 10000000000.0000 -> 0.0538
Traceback (most recent call last):
File "/Users/myk/PyPjcts/MedSAM_lite-0424/train_mps-6.py", line 538, in <module>
main()
File "/Users/myk/PyPjcts/MedSAM_lite-0424/train_mps-6.py", line 531, in main
plt.savefig(join(model_save_path, args.task_name, work_dir, "train_loss.png"))
File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/matplotlib/pyplot.py", line 1134, in savefig
res = fig.savefig(*args, **kwargs) # type: ignore[func-returns-value]
File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/matplotlib/figure.py", line 3390, in savefig
self.canvas.print_figure(fname, **kwargs)
File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/matplotlib/backend_bases.py", line 2193, in print_figure
result = print_method(
File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/matplotlib/backend_bases.py", line 2043, in <lambda>
print_method = functools.wraps(meth)(lambda *args, **kwargs: meth(
File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/matplotlib/backends/backend_agg.py", line 497, in print_png
self._print_pil(filename_or_obj, "png", pil_kwargs, metadata)
File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/matplotlib/backends/backend_agg.py", line 446, in _print_pil
mpl.image.imsave(
File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/matplotlib/image.py", line 1656, in imsave
image.save(fname, **pil_kwargs)
File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/PIL/Image.py", line 2456, in save
fp = builtins.open(filename, "w+b")
FileNotFoundError: [Errno 2] No such file or directory: '/Users/myk/PyPjcts/MedSAM_lite-0424/work_dir/MedSAM-lite-20240425-0236/MedSAM-lite/work_dir/train_loss.png'
wandb:
wandb: Run history:
wandb: ce_loss_weight ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: epoch_loss ▁
wandb: epoch_loss[step] ▁▁▃▄▃▃▃█▂▃▄▅▁▃▃▁▂▄▂▂▁▃▃▅▁▆▂▁▃▄▃▃▂▃▅█▄▃▁▃
wandb: iou_gt ▁
wandb: iou_loss_weight ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: l_ce ▂▁▄▃▄▂▁▂▅█▄▂▃▂▄▂▄▁▄▆▂▂▅▃▁▂▇▂▁▂▂▂▁▁▂▂▄▂▁▆
wandb: l_iou ▂▂▁▁▂▂▃█▁▂▃▂▁▁▂▁▁▄▁▂▁▂▂▁▁▂▁▁▂▂▅▂▂▂▃▃▂▁▁▂
wandb: l_seg ▁▁▂▄▃▃▃▇▁▂▄▅▁▃▃▁▂▄▂▂▁▃▃▅▁▆▂▁▃▄▂▂▂▃▅█▄▃▁▃
wandb: loss ▁▁▃▄▃▃▃█▂▃▄▅▁▃▃▁▂▄▂▂▁▃▃▅▁▆▂▁▃▄▃▃▂▃▅█▄▃▁▃
wandb: loss.item() ▁▁▃▄▃▃▃█▂▃▄▅▁▃▃▁▂▄▂▂▁▃▃▅▁▆▂▁▃▄▃▃▂▃▅█▄▃▁▃
wandb: mask_loss ▁▁▃▄▃▃▃▇▂▃▄▅▁▃▃▁▃▄▂▂▁▃▃▅▁▆▂▁▃▄▂▂▂▃▅█▅▃▁▃
wandb: seg_loss_weight ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:
wandb: Run summary:
wandb: ce_loss_weight 1.0
wandb: epoch_loss 0.05377
wandb: epoch_loss[step] 0.01675
wandb: iou_gt 0.9632
wandb: iou_loss_weight 1.0
wandb: l_ce 0.00258
wandb: l_iou 0.0
wandb: l_seg 0.01417
wandb: loss 0.01675
wandb: loss.item() 0.01675
wandb: mask_loss 0.01675
wandb: seg_loss_weight 1.0
wandb:
wandb: 🚀 View run run-6 at: https://wandb.ai/team-mykcs/MedSAM-lite/runs/mcvsmfi1
wandb: ⭐️ View project at: https://wandb.ai/team-mykcs/MedSAM-lite
wandb: Synced 6 W&B file(s), 0 media file(s), 2 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20240425_023642-mcvsmfi1/logs
Exception in thread IntMsgThr:
Traceback (most recent call last):
File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/wandb_run.py", line 300, in check_internal_messages
self._loop_check_status(
File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status
local_handle = request()
File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/interface/interface.py", line 856, in deliver_internal_messages
return self._deliver_internal_messages(internal_message)
File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/interface/interface_shared.py", line 516, in _deliver_internal_messages
return self._deliver_record(record)
File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/interface/interface_shared.py", line 459, in _deliver_record
handle = mailbox._deliver_record(record, interface=self)
File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
interface._publish(record)
File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
self._sock_client.send_record_publish(record)
File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
self.send_server_request(server_req)
File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._send_message(msg)
File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._sendall_with_error_handle(header + data)
File "/Users/myk/anaconda3/envs/env-MedSAM-0422/lib/python3.10/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
进程已结束,退出代码为 1
gemini
你遇到的错误主要有:
train_loss.png
) 保存到一个不存在的目录中。下面是详细的解释和解决方法:
1. FileNotFoundError: [Errno 2] No such file or directory
问题: FileNotFoundError
表明代码试图写入的文件路径不正确:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/myk/PyPjcts/MedSAM_lite-0424/work_dir/MedSAM-lite-20240425-0236/MedSAM-lite/work_dir/train_loss.png'
解决方法:
train_loss.png
之前,指定的路径已存在。你可能需要使用 os.makedirs(..., exist_ok=True)
来创建必要的目录结构。2. BrokenPipeError: [Errno 32] Broken pipe
问题: Weights & Biases(WandB)的同步进程异常中断。这可能是由于网络问题或与 WandB 服务器连接的临时故障造成的。
解决方法:
修改建议
import os
# ... your code ...
def main():
# ... other code ...
model_save_path = join(args.work_dir, args.task_name, work_dir)
os.makedirs(model_save_path, exist_ok=True) # Ensure directories exist
plt.savefig(join(model_save_path, "train_loss.png"))
# ... other code ...
说明
os.makedirs(model_save_path, exist_ok=True)
这一行,在保存图片之前创建路径,避免 FileNotFoundError
错误。train_mps-7.py
更改了一些 model_save_path
成了!
难道之前真的是因为model_save_path
错误中断了吗。
可以通过跑mps-6.py
,复现bug,找到问题所在。
因为时间关系(不求甚解),暂时搁置。
train_mps-4.py start_epoch = 0,num_epochs = 10 验证了不是这里出现的问题