jbohnslav / deepethogram

Other
98 stars 32 forks source link

Sequence training error #134

Open kylethieringer opened 1 year ago

kylethieringer commented 1 year ago

When training a new sequence model, I am running into an error where the model checkpoint is searching for 'val/f1_class_mean' in the metrics file but none is found. If I open the metrics file externally, no f1_class_mean dataset has been saved.

This error quits out of the currently running epoch and moves to the next one rather than finishing the current epoch.

Any help with this would be greatly appreciated! Thanks

/home/kyle/deg/lib/python3.10/site-packages/pytorch_lightning/trainer/data_loading.py:132: UserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 12 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Epoch 0:  61%|███████████████████████████████████████████████████████████▏                                     | 839/1376 [00:48<00:31, 17.14it/s, loss=188, v_num=0]/home/kyle/deg/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:644: UserWarning: ModelCheckpoint(monitor='val/f1_class_mean') not found in the returned metrics: ['train_loss', 'train/loss', 'train/fps', 'train/lr', 'train/data_loss', 'train/reg_loss', 'train/accuracy_overall', 'train/f1_overall', 'train/f1_class_mean', 'train/f1_class_mean_nobg', 'train/auroc_class_mean', 'train/mAP_class_mean', 'train/auroc_overall', 'train/mAP_overall']. HINT: Did you call self.log('val/f1_class_mean', value) in the LightningModule?
  warning_cache.warn(m)
/home/kyle/deg/lib/python3.10/site-packages/pytorch_lightning/trainer/data_loading.py:132: UserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 12 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Epoch 1:  13%|████████████▉                                                                                    | 183/1376 [00:10<01:06, 17.95it/s, loss=152, v_num=0]^C/home/kyle/deg/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:688: UserWarning: Detected KeyboardInterrupt, attempting graceful shutdown...
  rank_zero_warn("Detected KeyboardInterrupt, attempting graceful shutdown...")
Epoch 1:  13%|████████████▉  
kylethieringer commented 1 year ago

upon a fresh reinstall, I encountered an Index error in line 582 of datasets.py where it was trying to load in labels with a range of the length of the video however because of python indexing the last label is out of bounds. image

I added a non permanent fix that allows the model to train however there are caveats. The biggest is that it shifts my labels for the last chunk of data one frame off. For the behavior Im studying this shouldnt matter (1 frame is within the expected noise for trying to label the behavior, plus very rare for the behavior to be at the very end of the video). I think this might be the result of some padding when loading the labels but not exactly sure where the source is. Heres the lines of code I added if it helps anyone else:

in /deepethogram/data/datasets.py lines 578-581:

    if not self.reduce:
        # new code start
        if label_indices[-1]>=self.label.shape[1]:
            label_indices = [i-1 for i in label_indices]
        # new code end
        labels = self.label[:, label_indices].astype(np.int64)