ultralytics / hub

Ultralytics HUB tutorials and support
https://hub.ultralytics.com
GNU Affero General Public License v3.0
134 stars 13 forks source link

model upload issue; stucked with 100% Optimizing weights #796

Open lsun21 opened 2 months ago

lsun21 commented 2 months ago

Search before asking

HUB Component

Models, Training

Bug

I finished training the model on Google Colab but it failed to upload by the end. Screenshot 2024-08-07 at 7 38 58 PM Screenshot 2024-08-07 at 7 39 09 PM

Afterward, I tried to resume the model and run extra epoch, but the same failure happened again. How to recover the model properly?

Screenshot 2024-08-07 at 7 50 16 PM

Thanks for your input!

Environment

No response

Minimal Reproducible Example

No response

Additional

No response

github-actions[bot] commented 2 months ago

πŸ‘‹ Hello @lsun21, thank you for raising an issue about Ultralytics HUB πŸš€! Please visit our HUB Docs to learn more:

If this is a πŸ› Bug Report, please provide screenshots and steps to reproduce your problem to help us get started working on a fix.

If this is a ❓ Question, please provide as much information as possible, including dataset, model, environment details etc. so that we might provide the most helpful response.

We try to respond to all issues as promptly as possible. Thank you for your patience!

pderrenger commented 2 months ago

@lsun21 hello,

Thank you for reaching out and providing detailed information about the issue you're encountering. It sounds like you're experiencing a problem with the model upload process getting stuck at 100% during the optimization of weights.

To help us better understand and resolve this issue, could you please try the following steps:

  1. Update to the Latest Version: Ensure that you are using the latest versions of the Ultralytics packages. You can update them using the following commands:

    pip install --upgrade ultralytics
  2. Check Internet Connection: Sometimes, network issues can cause uploads to hang. Please verify that your internet connection is stable.

  3. Retry the Upload: Occasionally, retrying the upload process can resolve temporary issues. You can do this by running the following command in your Colab notebook:

    from ultralytics import YOLO
    
    model = YOLO('path/to/your/model.pt')
    model.upload()
  4. Log Files: If the issue persists, please check the log files for any error messages or warnings that might provide more insight into what is going wrong. You can find the logs in the runs directory of your project.

  5. Alternative Upload Method: If the direct upload continues to fail, you can manually upload the model to the Ultralytics HUB by downloading the .pt file from Colab and then uploading it through the HUB interface.

If you have tried all the above steps and the issue still persists, please let us know. Providing any additional error messages or logs would be very helpful for further troubleshooting.

Thank you for your patience and cooperation. We look forward to helping you resolve this issue!

lsun21 commented 2 months ago

Hi! Thanks for your promptness!

I followed your suggestions, upgraded ultralytics first, and tried to upload the model with the code above. There is an error on the line with model.upload(). Do you know how to fix it? upload

Thank you!!!

sergiuwaxmann commented 2 months ago

@lsun21 It looks like the final weights weren't uploaded... I suggest resuming training from the last checkpoint.

lsun21 commented 2 months ago

Thanks for your reply!

Which checkpoint do you suggest to resume here? It shows all the checkpoints (100/100) have been saved. Since I already ran an extra one (not sure if it's saved), it shows that the -1 epoch remains now....

Screenshot 2024-08-08 at 10 28 20 AM

Thanks for all of your help!

pderrenger commented 2 months ago

Hello @lsun21,

Thank you for your patience and for providing additional details. Given the situation, it seems like you have multiple checkpoints saved. To resume training from the last successful checkpoint, you can use the most recent one before the issue occurred.

Here’s how you can do it:

  1. Identify the Last Successful Checkpoint: Check the runs/train/exp/weights directory (or the equivalent directory where your training results are saved) for the latest checkpoint file. These files are typically named last.pt, best.pt, or epoch_xx.pt.

  2. Resume Training: Use the identified checkpoint to resume training. Here’s an example code snippet to help you resume training from a specific checkpoint:

    from ultralytics import YOLO
    
    # Load the model from the last successful checkpoint
    model = YOLO('path/to/your/checkpoint.pt')
    
    # Resume training
    model.train(data='path/to/your/data.yaml', epochs=additional_epochs)
  3. Upload the Model: After resuming and completing the additional epochs, try uploading the model again:

    model.upload()

If you encounter any issues during this process, please provide any error messages or logs that appear. This will help us diagnose the problem more effectively.

Thank you for your cooperation, and I hope this helps resolve the issue! If you have any further questions, feel free to ask. 😊

lsun21 commented 2 months ago

Thanks for your response.

I am now stuck resuming the model. It shows that the model has been trained with 100 epochs, so I assumed this is the checkpoint I should restart with, and I defined epochs = 1 only for time saving.

But somehow, it starts to train another 100 epochs as default. I changed the argument, but it still did not work. How should I fix it?

resume_100 102 epoch

Many thanks!

sergiuwaxmann commented 2 months ago

@lsun21 If you just use the command shown in the Ultralytics HUB UI to resume training (no extra arguments), does it work?

lsun21 commented 2 months ago

No, it still automatically starts training with another 100 epochs...

pderrenger commented 2 months ago

Hello @lsun21,

Thank you for your patience and for providing additional details. It seems like the training process is not respecting the specified number of epochs when resuming from a checkpoint. Let's try a more explicit approach to ensure the correct number of epochs is set.

Here's how you can explicitly set the number of epochs when resuming training:

  1. Load the Model and Set the Number of Epochs:

    from ultralytics import YOLO
    
    # Load the model from the last successful checkpoint
    model = YOLO('path/to/your/checkpoint.pt')
    
    # Resume training with the specified number of epochs
    model.train(data='path/to/your/data.yaml', epochs=1, resume=True)
  2. Verify the Training Configuration: Ensure that the training configuration is correctly set to resume from the checkpoint and only run for the specified number of epochs.

If the issue persists, please make sure you are using the latest version of the Ultralytics package. You can update it using:

pip install --upgrade ultralytics

If you continue to experience difficulties, please provide any additional error messages or logs that appear. This will help us diagnose the problem more effectively.

Thank you for your cooperation, and I hope this helps resolve the issue! If you have any further questions, feel free to ask. 😊

lsun21 commented 2 months ago

Thanks for your continued input.

I tried to specify the epoch =1 (or run without argument @sergiuwaxmann suggested), but the model always shows "start training for 200 epochs", which I never defined before.....

Here is the full log: requirements: Ultralytics requirement ['hub-sdk>=0.0.8'] not found, attempting AutoUpdate... Collecting hub-sdk>=0.0.8 Downloading hub_sdk-0.0.8-py3-none-any.whl.metadata (10 kB) Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from hub-sdk>=0.0.8) (2.32.3) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->hub-sdk>=0.0.8) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->hub-sdk>=0.0.8) (3.7) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->hub-sdk>=0.0.8) (2.0.7) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->hub-sdk>=0.0.8) (2024.7.4) Downloading hub_sdk-0.0.8-py3-none-any.whl (40 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 40.9/40.9 kB 4.9 MB/s eta 0:00:00 Installing collected packages: hub-sdk Successfully installed hub-sdk-0.0.8

requirements: AutoUpdate success βœ… 3.3s, installed 1 package: ['hub-sdk>=0.0.8'] requirements: ⚠️ Restart runtime or rerun command for updates to take effect

Ultralytics HUB: New authentication successful βœ… Ultralytics HUB: View model at https://hub.ultralytics.com/models/cUDUDcKp7iarW2k2VNn8 πŸš€ Downloading https://storage.googleapis.com/ultralytics-hub.appspot.com/users/uwwrWu3vbnOw9IfulmjBYyFmrUV2/models/cUDUDcKp7iarW2k2VNn8/epoch-100.pt to 'weights/epoch-100.pt'... 2024-08-10 19:08:00,453 - hub_sdk.helpers.logger - ERROR - Unknown error occurred. ERROR:hub_sdk.helpers.logger:Unknown error occurred. 2024-08-10 19:08:00,457 - hub_sdk.helpers.logger - ERROR - Failed to start heartbeats: 'NoneType' object has no attribute 'json' ERROR:hub_sdk.helpers.logger:Failed to start heartbeats: 'NoneType' object has no attribute 'json' Exception in thread Thread-10 (_start_heartbeats): Traceback (most recent call last): File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.10/dist-packages/hub_sdk/base/server_clients.py", line 151, in _start_heartbeats raise e File "/usr/local/lib/python3.10/dist-packages/hub_sdk/base/server_clients.py", line 139, in _start_heartbeats res = self.post(endpoint, json=payload).json() AttributeError: 'NoneType' object has no attribute 'json' 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 261M/261M [00:13<00:00, 20.9MB/s] WARNING ⚠️ using HUB training arguments, ignoring local training arguments. Ultralytics YOLOv8.2.75 πŸš€ Python-3.10.12 torch-2.3.1+cu121 CUDA:0 (Tesla T4, 15102MiB) engine/trainer: task=detect, mode=train, model=weights/epoch-100.pt, data=https://app.roboflow.com/ds/eL8DtSIPgc?key=YMzS4TZHn6, epochs=100, time=None, patience=100, batch=9, imgsz=640, save=True, save_period=-1, cache=None, device=[0], workers=8, project=None, name=train, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=weights/epoch-100.pt, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.0, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=0.0, mixup=0.0, copy_paste=0.0, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=runs/detect/train Downloading https://app.roboflow.com/ds/eL8DtSIPgc to 'eL8DtSIPgc'... 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 580M/580M [00:12<00:00, 48.9MB/s] Unzipping eL8DtSIPgc to /content/datasets/eL8DtSIPgc...: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 75540/75540 [00:12<00:00, 5948.42file/s] Downloading https://ultralytics.com/assets/Arial.ttf to '/root/.config/Ultralytics/Arial.ttf'... 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 755k/755k [00:00<00:00, 19.5MB/s] TensorBoard: Start with 'tensorboard --logdir runs/detect/train', view at http://localhost:6006/

               from  n    params  module                                       arguments                     

0 -1 1 2320 ultralytics.nn.modules.conv.Conv [3, 80, 3, 2]
1 -1 1 115520 ultralytics.nn.modules.conv.Conv [80, 160, 3, 2]
2 -1 3 436800 ultralytics.nn.modules.block.C2f [160, 160, 3, True]
3 -1 1 461440 ultralytics.nn.modules.conv.Conv [160, 320, 3, 2]
4 -1 6 3281920 ultralytics.nn.modules.block.C2f [320, 320, 6, True]
5 -1 1 1844480 ultralytics.nn.modules.conv.Conv [320, 640, 3, 2]
6 -1 6 13117440 ultralytics.nn.modules.block.C2f [640, 640, 6, True]
7 -1 1 3687680 ultralytics.nn.modules.conv.Conv [640, 640, 3, 2]
8 -1 3 6969600 ultralytics.nn.modules.block.C2f [640, 640, 3, True]
9 -1 1 1025920 ultralytics.nn.modules.block.SPPF [640, 640, 5]
10 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
11 [-1, 6] 1 0 ultralytics.nn.modules.conv.Concat [1]
12 -1 3 7379200 ultralytics.nn.modules.block.C2f [1280, 640, 3]
13 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
14 [-1, 4] 1 0 ultralytics.nn.modules.conv.Concat [1]
15 -1 3 1948800 ultralytics.nn.modules.block.C2f [960, 320, 3]
16 -1 1 922240 ultralytics.nn.modules.conv.Conv [320, 320, 3, 2]
17 [-1, 12] 1 0 ultralytics.nn.modules.conv.Concat [1]
18 -1 3 7174400 ultralytics.nn.modules.block.C2f [960, 640, 3]
19 -1 1 3687680 ultralytics.nn.modules.conv.Conv [640, 640, 3, 2]
20 [-1, 9] 1 0 ultralytics.nn.modules.conv.Concat [1]
21 -1 3 7379200 ultralytics.nn.modules.block.C2f [1280, 640, 3]
22 [15, 18, 21] 1 8726635 ultralytics.nn.modules.head.Detect [9, [320, 640, 640]]
Model summary: 365 layers, 68,161,275 parameters, 68,161,259 gradients, 258.2 GFLOPs

Transferred 595/595 items from pretrained weights Freezing layer 'model.22.dfl.conv.weight' AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n... Downloading https://github.com/ultralytics/assets/releases/download/v8.2.0/yolov8n.pt to 'yolov8n.pt'... 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 6.25M/6.25M [00:00<00:00, 109MB/s] AMP: checks passed βœ… train: Scanning /content/datasets/eL8DtSIPgc/train/labels... 33048 images, 17997 backgrounds, 0 corrupt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 33048/33048 [00:12<00:00, 2646.43it/s] train: New cache created: /content/datasets/eL8DtSIPgc/train/labels.cache albumentations: Blur(p=0.01, blur_limit=(3, 7)), MedianBlur(p=0.01, blur_limit=(3, 7)), ToGray(p=0.01), CLAHE(p=0.01, clip_limit=(1, 4.0), tile_grid_size=(8, 8)) /usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. self.pid = os.fork() val: Scanning /content/datasets/eL8DtSIPgc/valid/labels... 3141 images, 1750 backgrounds, 0 corrupt: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3141/3141 [00:01<00:00, 1967.34it/s] val: New cache created: /content/datasets/eL8DtSIPgc/valid/labels.cache Plotting labels to runs/detect/train/labels.jpg... optimizer: 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically... optimizer: SGD(lr=0.01, momentum=0.9) with parameter groups 97 weight(decay=0.0), 104 weight(decay=0.0004921875), 103 bias(decay=0.0) Resuming training weights/epoch-100.pt from epoch 102 to 100 total epochs DetectionModel( (model): Sequential( (0): Conv( (conv): Conv2d(3, 80, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn): BatchNorm2d(80, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (1): Conv( (conv): Conv2d(80, 160, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn): BatchNorm2d(160, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (2): C2f( (cv1): Conv( (conv): Conv2d(160, 160, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn): BatchNorm2d(160, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (cv2): Conv( (conv): Conv2d(400, 160, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn): BatchNorm2d(160, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (m): ModuleList( (0-2): 3 x Bottleneck( (cv1): Conv( (conv): Conv2d(80, 80, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn): BatchNorm2d(80, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (cv2): Conv( (conv): Conv2d(80, 80, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn): BatchNorm2d(80, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) ) ) ) (3): Conv( (conv): Conv2d(160, 320, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn): BatchNorm2d(320, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (4): C2f( (cv1): Conv( (conv): Conv2d(320, 320, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn): BatchNorm2d(320, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (cv2): Conv( (conv): Conv2d(1280, 320, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn): BatchNorm2d(320, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (m): ModuleList( (0-5): 6 x Bottleneck( (cv1): Conv( (conv): Conv2d(160, 160, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn): BatchNorm2d(160, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (cv2): Conv( (conv): Conv2d(160, 160, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn): BatchNorm2d(160, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) ) ) ) (5): Conv( (conv): Conv2d(320, 640, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn): BatchNorm2d(640, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (6): C2f( (cv1): Conv( (conv): Conv2d(640, 640, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn): BatchNorm2d(640, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (cv2): Conv( (conv): Conv2d(2560, 640, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn): BatchNorm2d(640, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (m): ModuleList( (0-5): 6 x Bottleneck( (cv1): Conv( (conv): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn): BatchNorm2d(320, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (cv2): Conv( (conv): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn): BatchNorm2d(320, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) ) ) ) (7): Conv( (conv): Conv2d(640, 640, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn): BatchNorm2d(640, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (8): C2f( (cv1): Conv( (conv): Conv2d(640, 640, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn): BatchNorm2d(640, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (cv2): Conv( (conv): Conv2d(1600, 640, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn): BatchNorm2d(640, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (m): ModuleList( (0-2): 3 x Bottleneck( (cv1): Conv( (conv): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn): BatchNorm2d(320, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (cv2): Conv( (conv): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn): BatchNorm2d(320, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) ) ) ) (9): SPPF( (cv1): Conv( (conv): Conv2d(640, 320, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn): BatchNorm2d(320, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (cv2): Conv( (conv): Conv2d(1280, 640, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn): BatchNorm2d(640, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (m): MaxPool2d(kernel_size=5, stride=1, padding=2, dilation=1, ceil_mode=False) ) (10): Upsample(scale_factor=2.0, mode='nearest') (11): Concat() (12): C2f( (cv1): Conv( (conv): Conv2d(1280, 640, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn): BatchNorm2d(640, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (cv2): Conv( (conv): Conv2d(1600, 640, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn): BatchNorm2d(640, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (m): ModuleList( (0-2): 3 x Bottleneck( (cv1): Conv( (conv): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn): BatchNorm2d(320, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (cv2): Conv( (conv): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn): BatchNorm2d(320, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) ) ) ) (13): Upsample(scale_factor=2.0, mode='nearest') (14): Concat() (15): C2f( (cv1): Conv( (conv): Conv2d(960, 320, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn): BatchNorm2d(320, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (cv2): Conv( (conv): Conv2d(800, 320, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn): BatchNorm2d(320, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (m): ModuleList( (0-2): 3 x Bottleneck( (cv1): Conv( (conv): Conv2d(160, 160, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn): BatchNorm2d(160, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (cv2): Conv( (conv): Conv2d(160, 160, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn): BatchNorm2d(160, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) ) ) ) (16): Conv( (conv): Conv2d(320, 320, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn): BatchNorm2d(320, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (17): Concat() (18): C2f( (cv1): Conv( (conv): Conv2d(960, 640, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn): BatchNorm2d(640, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (cv2): Conv( (conv): Conv2d(1600, 640, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn): BatchNorm2d(640, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (m): ModuleList( (0-2): 3 x Bottleneck( (cv1): Conv( (conv): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn): BatchNorm2d(320, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (cv2): Conv( (conv): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn): BatchNorm2d(320, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) ) ) ) (19): Conv( (conv): Conv2d(640, 640, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (bn): BatchNorm2d(640, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (20): Concat() (21): C2f( (cv1): Conv( (conv): Conv2d(1280, 640, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn): BatchNorm2d(640, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (cv2): Conv( (conv): Conv2d(1600, 640, kernel_size=(1, 1), stride=(1, 1), bias=False) (bn): BatchNorm2d(640, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (m): ModuleList( (0-2): 3 x Bottleneck( (cv1): Conv( (conv): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn): BatchNorm2d(320, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (cv2): Conv( (conv): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn): BatchNorm2d(320, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) ) ) ) (22): Detect( (cv2): ModuleList( (0): Sequential( (0): Conv( (conv): Conv2d(320, 80, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn): BatchNorm2d(80, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (1): Conv( (conv): Conv2d(80, 80, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn): BatchNorm2d(80, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (2): Conv2d(80, 64, kernel_size=(1, 1), stride=(1, 1)) ) (1-2): 2 x Sequential( (0): Conv( (conv): Conv2d(640, 80, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn): BatchNorm2d(80, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (1): Conv( (conv): Conv2d(80, 80, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn): BatchNorm2d(80, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (2): Conv2d(80, 64, kernel_size=(1, 1), stride=(1, 1)) ) ) (cv3): ModuleList( (0): Sequential( (0): Conv( (conv): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn): BatchNorm2d(320, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (1): Conv( (conv): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn): BatchNorm2d(320, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (2): Conv2d(320, 9, kernel_size=(1, 1), stride=(1, 1)) ) (1-2): 2 x Sequential( (0): Conv( (conv): Conv2d(640, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn): BatchNorm2d(320, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (1): Conv( (conv): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (bn): BatchNorm2d(320, eps=0.001, momentum=0.03, affine=True, track_running_stats=True) (act): SiLU(inplace=True) ) (2): Conv2d(320, 9, kernel_size=(1, 1), stride=(1, 1)) ) ) (dfl): DFL( (conv): Conv2d(16, 1, kernel_size=(1, 1), stride=(1, 1), bias=False) ) ) ) ) has been trained for 100 epochs. Fine-tuning for 100 more epochs. TensorBoard: model graph visualization added βœ… Image sizes 640 train, 640 val Using 2 dataloader workers Logging results to runs/detect/train Starting training for 200 epochs...

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
102/200      8.62G     0.4382     0.3942     0.9903          3        640:   1%|          | 21/3672 [00:15<41:57,  1.45it/s]/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.

self.pid = os.fork() 102/200 8.62G 0.4382 0.3942 0.9903 3 640: 1%| | 21/3672 [00:16<47:13, 1.29it/s]

Longhuiberkeley commented 1 month ago

I am also having similar issue for yolov8l or yolov8x models

glenn-jocher commented 1 month ago

Hello!

It seems like you're encountering a similar issue with the yolov8l or yolov8x models. Here are a few steps you can try to resolve this:

  1. Update Packages: Ensure you are using the latest version of the Ultralytics package. You can update it using:

    pip install -U ultralytics
  2. Resume Training: When resuming training, make sure to specify the correct checkpoint and desired number of epochs. For example:

    model.train(data='your_data.yaml', epochs=1, resume=True)
  3. Check Arguments: If the training defaults to 200 epochs, it might be due to HUB-specific arguments overriding your local settings. Ensure you're not using conflicting parameters.

  4. Logs and Errors: Review any error messages or logs for additional clues. The error related to 'NoneType' object has no attribute 'json' might indicate a network or server issue. Retrying the operation could help.

If the problem persists, please ensure it's reproducible with the latest versions and feel free to provide more details here. The community and the Ultralytics team are always here to help! 😊

If you have any more questions, feel free to ask!