ocean-data-factory-sweden / kso

Notebooks to upload/download marine footage, connect to a citizen science project, train machine learning models and publish marine biological observations.
GNU General Public License v3.0
4 stars 12 forks source link

Issue in tutorial 5: ML training #294

Closed Bergylta closed 9 months ago

Bergylta commented 9 months ago

šŸ› Bug

To Reproduce (REQUIRED)

Model type: Object detection model Select model: baseline-yolo5 Batch size: 8, Epoch: 80, H/W: 128x128 Input:

mlp.train_yolov5(
    exp_name.value,
    weights.artifact_path,
    project,
    epochs=epochs.value,
    batch_size=batch_size.value,
    img_size=(img_h.value, img_w.value),
)

Output:

train: weights=/mimer/NOBACKUP/groups/snic2021-6-9/tmp_dir/KSO_model_training_roughanemone_and_cushionstar/yolov5m.pt, cfg=, data=/mimer/NOBACKUP/groups/snic2021-6-9/tmp_dir/KSO_Model_training_frames_1/Koster_Seafloor_Obs_10:45:07.yaml, hyp=/mimer/NOBACKUP/groups/snic2021-6-9/tmp_dir/KSO_Model_training_frames_1/hyp.yaml, epochs=80, batch_size=1, imgsz=(128, 128), rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=Project(Project_name='Koster_Seafloor_Obs', Zooniverse_number=9747, db_path='/tmp/db/koster_lab.db', server='SNIC', bucket='None', key='None', csv_folder='/mimer/NOBACKUP/groups/snic2021-6-9/db_starter/csv_Koster_Seafloor_Obs/', movie_folder='/mimer/NOBACKUP/groups/snic2021-6-9/project_movies/movies_Koster/', photo_folder='None', ml_folder='None'), name=Rough_and_cushion_5_emil, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=koster, upload_dataset=True, bbox_interval=-1, artifact_alias=latest, cache_images=True
error: cannot open /usr/src/app/kso/.git/modules/yolov5/FETCH_HEAD: Read-only file system
Command 'git fetch origin' returned non-zero exit status 255.
YOLOv5 šŸš€ 2023-9-15 Python-3.8.10 torch-2.0.1+cu117 CUDA:0 (NVIDIA A40, 45626MiB)

hyperparameters: anchor_t=4.0, box=0.05, cls=0.5, cls_pw=1.0, copy_paste=0.0, degrees=0.0, fl_gamma=0.0, fliplr=0.5, flipud=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, iou_t=0.2, lr0=0.01, lrf=0.1, mixup=0.0, momentum=0.937, mosaic=1.0, obj=1.0, obj_pw=1.0, perspective=0.0, scale=0.5, shear=0.0, translate=0.1, warmup_bias_lr=0.1, warmup_epochs=3.0, warmup_momentum=0.8, weight_decay=0.0005
ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLOv5 šŸš€ in ClearML
Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 šŸš€ runs in Comet
TensorBoard: Start with 'tensorboard --logdir Project(Project_name='Koster_Seafloor_Obs', Zooniverse_number=9747, db_path='/tmp/db/koster_lab.db', server='SNIC', bucket='None', key='None', csv_folder='/mimer/NOBACKUP/groups/snic2021-6-9/db_starter/csv_Koster_Seafloor_Obs/', movie_folder='/mimer/NOBACKUP/groups/snic2021-6-9/project_movies/movies_Koster/', photo_folder='None', ml_folder='None')', view at http://localhost:6006/
Tracking run with wandb version 0.13.2
Run data is saved locally in /mimer/NOBACKUP/groups/snic2021-6-9/wandb/run-20231010_114235-fq4cedpx
Syncing run [Rough_and_cushion_5_emil](https://wandb.ai/koster/%27%2C%20photo_folder%3D%27None%27%2C%20ml_folder%3D%27None%27%29/runs/fq4cedpx) to [Weights & Biases](https://wandb.ai/koster/%27%2C%20photo_folder%3D%27None%27%2C%20ml_folder%3D%27None%27%29) ([docs](https://wandb.me/run))
Overriding model.yaml nc=80 with nc=1

                 from  n    params  module                                  arguments                     
  0                -1  1      5280  models.common.Focus                     [3, 48, 3]                    
  1                -1  1     41664  models.common.Conv                      [48, 96, 3, 2]                
  2                -1  2     65280  models.common.C3                        [96, 96, 2]                   
  3                -1  1    166272  models.common.Conv                      [96, 192, 3, 2]               
  4                -1  6    629760  models.common.C3                        [192, 192, 6]                 
  5                -1  1    664320  models.common.Conv                      [192, 384, 3, 2]              
  6                -1  6   2512896  models.common.C3                        [384, 384, 6]                 
  7                -1  1   2655744  models.common.Conv                      [384, 768, 3, 2]              
  8                -1  1   1476864  models.common.SPP                       [768, 768, [5, 9, 13]]        
  9                -1  2   4134912  models.common.C3                        [768, 768, 2, False]          
 10                -1  1    295680  models.common.Conv                      [768, 384, 1, 1]              
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 12           [-1, 6]  1         0  models.common.Concat                    [1]                           
 13                -1  2   1182720  models.common.C3                        [768, 384, 2, False]          
 14                -1  1     74112  models.common.Conv                      [384, 192, 1, 1]              
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']          
 16           [-1, 4]  1         0  models.common.Concat                    [1]                           
 17                -1  2    296448  models.common.C3                        [384, 192, 2, False]          
 18                -1  1    332160  models.common.Conv                      [192, 192, 3, 2]              
 19          [-1, 14]  1         0  models.common.Concat                    [1]                           
 20                -1  2   1035264  models.common.C3                        [384, 384, 2, False]          
 21                -1  1   1327872  models.common.Conv                      [384, 384, 3, 2]              
 22          [-1, 10]  1         0  models.common.Concat                    [1]                           
 23                -1  2   4134912  models.common.C3                        [768, 768, 2, False]          
 24      [17, 20, 23]  1     24246  models.yolo.Detect                      [1, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [192, 384, 768]]
Model summary: 309 layers, 21056406 parameters, 21056406 gradients

Transferred 499/505 items from /mimer/NOBACKUP/groups/snic2021-6-9/tmp_dir/KSO_model_training_roughanemone_and_cushionstar/yolov5m.pt
AMP: checks passed āœ…
optimizer: SGD(lr=0.01) with parameter groups 83 weight(decay=0.0), 86 weight(decay=0.0005), 86 bias
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[14], line 1
----> 1 mlp.train_yolov5(
      2     exp_name.value,
      3     weights.artifact_path,
      4     project,
      5     epochs=epochs.value,
      6     batch_size=batch_size.value,
      7     img_size=(img_h.value, img_w.value),
      8 )

File /usr/src/app/kso-dev/kso_utils/project.py:1255, in MLProjectProcessor.train_yolov5(self, exp_name, weights, project, epochs, batch_size, img_size)
   1251 def train_yolov5(
   1252     self, exp_name, weights, project, epochs=50, batch_size=16, img_size=[640, 640]
   1253 ):
   1254     if self.model_type == 1:
-> 1255         self.modules["train"].run(
   1256             entity=self.team_name,
   1257             data=self.data_path,
   1258             hyp=self.hyp_path,
   1259             weights=weights,
   1260             project=project,
   1261             name=exp_name,
   1262             imgsz=img_size,
   1263             batch_size=int(batch_size),
   1264             epochs=epochs,
   1265             single_cls=False,
   1266             cache_images=True,
   1267             upload_dataset=True,
   1268         )
   1269     elif self.model_type == 2:
   1270         self.modules["train"].run(
   1271             entity=self.team_name,
   1272             data=self.data_path,
   (...)
   1278             epochs=epochs,
   1279         )

File /usr/src/app/kso/yolov5/train.py:627, in run(**kwargs)
    625 for k, v in kwargs.items():
    626     setattr(opt, k, v)
--> 627 main(opt)
    628 return opt

File /usr/src/app/kso/yolov5/train.py:527, in main(opt, callbacks)
    525 # Train
    526 if not opt.evolve:
--> 527     train(opt.hyp, opt, device, callbacks)
    529 # Evolve hyperparameters (optional)
    530 else:
    531     # Hyperparameter evolution metadata (mutation scale 0-1, lower_limit, upper_limit)
    532     meta = {
    533         'lr0': (1, 1e-5, 1e-1),  # initial learning rate (SGD=1E-2, Adam=1E-3)
    534         'lrf': (1, 0.01, 1.0),  # final OneCycleLR learning rate (lr0 * lrf)
   (...)
    560         'mixup': (1, 0.0, 1.0),  # image mixup (probability)
    561         'copy_paste': (1, 0.0, 1.0)}  # segment copy-paste (probability)

File /usr/src/app/kso/yolov5/train.py:187, in train(hyp, opt, device, callbacks)
    184     LOGGER.info('Using SyncBatchNorm()')
    186 # Trainloader
--> 187 train_loader, dataset = create_dataloader(train_path,
    188                                           imgsz,
    189                                           batch_size // WORLD_SIZE,
    190                                           gs,
    191                                           single_cls,
    192                                           hyp=hyp,
    193                                           augment=True,
    194                                           cache=None if opt.cache == 'val' else opt.cache,
    195                                           rect=opt.rect,
    196                                           rank=LOCAL_RANK,
    197                                           workers=workers,
    198                                           image_weights=opt.image_weights,
    199                                           quad=opt.quad,
    200                                           prefix=colorstr('train: '),
    201                                           shuffle=True)
    202 labels = np.concatenate(dataset.labels, 0)
    203 mlc = int(labels[:, 0].max())  # max label class

File /usr/src/app/kso/yolov5/utils/dataloaders.py:123, in create_dataloader(path, imgsz, batch_size, stride, single_cls, hyp, augment, cache, pad, rect, rank, workers, image_weights, quad, prefix, shuffle)
    121     shuffle = False
    122 with torch_distributed_zero_first(rank):  # init dataset *.cache only once if DDP
--> 123     dataset = LoadImagesAndLabels(
    124         path,
    125         imgsz,
    126         batch_size,
    127         augment=augment,  # augmentation
    128         hyp=hyp,  # hyperparameters
    129         rect=rect,  # rectangular batches
    130         cache_images=cache,
    131         single_cls=single_cls,
    132         stride=int(stride),
    133         pad=pad,
    134         image_weights=image_weights,
    135         prefix=prefix)
    137 batch_size = min(batch_size, len(dataset))
    138 nd = torch.cuda.device_count()  # number of CUDA devices

File /usr/src/app/kso/yolov5/utils/dataloaders.py:456, in LoadImagesAndLabels.__init__(self, path, img_size, batch_size, augment, hyp, rect, image_weights, cache_images, single_cls, stride, pad, min_items, prefix)
    454 self.rect = False if image_weights else rect
    455 self.mosaic = self.augment and not self.rect  # load 4 images at a time into a mosaic (only during training)
--> 456 self.mosaic_border = [-img_size // 2, -img_size // 2]
    457 self.stride = stride
    458 self.path = path

TypeError: bad operand type for unary -: 'list'

Expected behavior

Additional context

image image image image image

jannesgg commented 9 months ago

The issue is here:

img_size=(img_h.value, img_w.value)

Changing this to img_size=img_h.value should solve the problem. I will fix this on my side as well.

Bergylta commented 9 months ago

New issue in the same location

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[13], line 1
----> 1 mlp.train_yolov5(
      2     exp_name.value,
      3     weights.artifact_path,
      4     project,
      5     epochs=epochs.value,
      6     batch_size=batch_size.value,
      7     img_size=(img_h.value),
      8 )

File /usr/src/app/kso-dev/kso_utils/project.py:1255, in MLProjectProcessor.train_yolov5(self, exp_name, weights, project, epochs, batch_size, img_size)
   1251 def train_yolov5(
   1252     self, exp_name, weights, project, epochs=50, batch_size=16, img_size=[640, 640]
   1253 ):
   1254     if self.model_type == 1:
-> 1255         self.modules["train"].run(
   1256             entity=self.team_name,
   1257             data=self.data_path,
   1258             hyp=self.hyp_path,
   1259             weights=weights,
   1260             project=project,
   1261             name=exp_name,
   1262             imgsz=img_size,
   1263             batch_size=int(batch_size),
   1264             epochs=epochs,
   1265             single_cls=False,
   1266             cache_images=True,
   1267             upload_dataset=True,
   1268         )
   1269     elif self.model_type == 2:
   1270         self.modules["train"].run(
   1271             entity=self.team_name,
   1272             data=self.data_path,
   (...)
   1278             epochs=epochs,
   1279         )

File /usr/src/app/kso/yolov5/train.py:627, in run(**kwargs)
    625 for k, v in kwargs.items():
    626     setattr(opt, k, v)
--> 627 main(opt)
    628 return opt

File /usr/src/app/kso/yolov5/train.py:527, in main(opt, callbacks)
    525 # Train
    526 if not opt.evolve:
--> 527     train(opt.hyp, opt, device, callbacks)
    529 # Evolve hyperparameters (optional)
    530 else:
    531     # Hyperparameter evolution metadata (mutation scale 0-1, lower_limit, upper_limit)
    532     meta = {
    533         'lr0': (1, 1e-5, 1e-1),  # initial learning rate (SGD=1E-2, Adam=1E-3)
    534         'lrf': (1, 0.01, 1.0),  # final OneCycleLR learning rate (lr0 * lrf)
   (...)
    560         'mixup': (1, 0.0, 1.0),  # image mixup (probability)
    561         'copy_paste': (1, 0.0, 1.0)}  # segment copy-paste (probability)

File /usr/src/app/kso/yolov5/train.py:414, in train(hyp, opt, device, callbacks)
    408 if f is best:
    409     LOGGER.info(f'\nValidating {f}...')
    410     results, _, _ = validate.run(
    411         data_dict,
    412         batch_size=batch_size // WORLD_SIZE * 2,
    413         imgsz=imgsz,
--> 414         model=attempt_load(f, device).half(),
    415         iou_thres=0.65 if is_coco else 0.60,  # best pycocotools at iou 0.65
    416         single_cls=single_cls,
    417         dataloader=val_loader,
    418         save_dir=save_dir,
    419         save_json=is_coco,
    420         verbose=True,
    421         plots=plots,
    422         callbacks=callbacks,
    423         compute_loss=compute_loss)  # val best model with plots
    424     if is_coco:
    425         callbacks.run('on_fit_epoch_end', list(mloss) + list(results) + lr, epoch, best_fitness, fi)

File /usr/src/app/kso/yolov5/models/experimental.py:79, in attempt_load(weights, device, inplace, fuse)
     77 model = Ensemble()
     78 for w in weights if isinstance(weights, list) else [weights]:
---> 79     ckpt = torch.load(attempt_download(w), map_location='cpu')  # load
     80     ckpt = (ckpt.get('ema') or ckpt['model']).to(device).float()  # FP32 model
     82     # Model compatibility updates

File /usr/local/lib/python3.8/dist-packages/torch/serialization.py:791, in load(f, map_location, pickle_module, weights_only, **pickle_load_args)
    788 if 'encoding' not in pickle_load_args.keys():
    789     pickle_load_args['encoding'] = 'utf-8'
--> 791 with _open_file_like(f, 'rb') as opened_file:
    792     if _is_zipfile(opened_file):
    793         # The zipfile reader is going to advance the current file position.
    794         # If we want to actually tail call to torch.jit.load, we need to
    795         # reset back to the original position.
    796         orig_position = opened_file.tell()

File /usr/local/lib/python3.8/dist-packages/torch/serialization.py:271, in _open_file_like(name_or_buffer, mode)
    269 def _open_file_like(name_or_buffer, mode):
    270     if _is_path(name_or_buffer):
--> 271         return _open_file(name_or_buffer, mode)
    272     else:
    273         if 'w' in mode:

File /usr/local/lib/python3.8/dist-packages/torch/serialization.py:252, in _open_file.__init__(self, name, mode)
    251 def __init__(self, name, mode):
--> 252     super().__init__(open(name, mode))

FileNotFoundError: [Errno 2] No such file or directory: 'Project(Project_name=Koster_Seafloor_Obs, Zooniverse_number=9747, db_path=/tmp/db/koster_lab.db, server=SNIC, bucket=None, key=None, csv_folder=/mimer/NOBACKUP/groups/snic2021-6-9/db_starter/csv_Koster_Seafloor_Obs/, movie_folder=/mimer/NOBACKUP/groups/snic2021-6-9/project_movies/movies_Koster/, photo_folder=None, ml_folder=None)/KSO_rough_cushion_5_80_emil/weights/best.pt'
jannesgg commented 9 months ago

----> 1 mlp.train_yolov5( 2 exp_name.value, 3 weights.artifact_path, 4 project.Project_name, 5 epochs=epochs.value, 6 batch_size=batch_size.value, 7 img_size=(img_h.value), 8 )