Issues with training on custom dataset

nadavmisgav commented 2 years ago

Hello there,
I was trying to train on my custom dataset, currently holding about 60 images for training and 60 for validation.
Using the following train_config.yml

output_root: '../'
pytorch_data_dir: '/content/drive/MyDrive/custom_dataset/'
experiment_name: "exp1"
log_dir: "exp1"
azureml_logging: True
submitting_to_aml: False

# Loader params
num_workers: 1
max_steps: 10
batch_size: 16

num_neighbors: 7
dataset_name: "directory"

# Used if dataset_name is "directory"
dir_dataset_name: "train_data"
dir_dataset_n_classes: 5

has_labels: False
crop_type: "five"
crop_ratio: .5
res: 224
loader_crop_type: "center"

# Model Params
extra_clusters: 0
use_true_labels: False
use_recalibrator: False
model_type: "vit_small"
arch: "dino"
use_fit_model: False
dino_feat_type: "feat"
projection_type: "nonlinear"
#projection_type: linear
dino_patch_size: 8
granularity: 1
continuous: True
dim: 70
dropout: True
zero_clamp: True

lr: 5e-4
pretrained_weights: ~
use_salience: False
stabalize: False
stop_at_zero: True

# Feature Contrastive params
pointwise: True
feature_samples: 11
neg_samples: 5
aug_alignment_weight: 0.0

correspondence_weight: 1.0

# IAROA vit small 1/31/22
neg_inter_weight: 0.63
pos_inter_weight: 0.25
pos_intra_weight: 0.67
neg_inter_shift: 0.46
pos_inter_shift: 0.12
pos_intra_shift: 0.18

# Potsdam vit small 1/31/22
#neg_inter_weight: 0.63
#pos_inter_weight: 0.25
#pos_intra_weight: 0.67
#neg_inter_shift: 0.46
#pos_inter_shift: 0.02
#pos_intra_shift: 0.08

# Cocostuff27 vit small 1/31/22
#neg_inter_weight: 0.63
#pos_inter_weight: 0.25
#pos_intra_weight: 0.67
#neg_inter_shift: 0.66
#pos_inter_shift: 0.02
#pos_intra_shift: 0.08

## Cocostuff27 10/3 vit_base

#neg_inter_weight: 0.1538476246415498
#pos_inter_weight: 1
#pos_intra_weight: 0.1
#
#neg_inter_shift: 1
#pos_inter_shift: 0.2
#pos_intra_shift: 0.12

## Cocostuff27 10/3 vit_small
#neg_inter_weight: .63
#pos_inter_weight: .25
#pos_intra_weight: .67
#
#neg_inter_shift: .16
#pos_inter_shift: .02
#pos_intra_shift: .08

## Cocostuff27 10/3 moco
#neg_inter_weight: .63
#pos_inter_weight: .25
#pos_intra_weight: .67
#
#neg_inter_shift: .26
#pos_inter_shift: .36
#pos_intra_shift: .32

#pos_inter_shift: .12
#pos_intra_shift: .18

## Cocostuff27
#neg_inter_weight: .72
#pos_inter_weight: .80
#pos_intra_weight: .29
#
#neg_inter_shift: .86
#pos_inter_shift: .04
#pos_intra_shift: .34

# Cityscapes 10/3

#neg_inter_weight: 0.9058762625226623
#pos_inter_weight: 0.577453483136995
#pos_intra_weight: 1
#
#neg_inter_shift: 0.31361241889448443
#pos_inter_shift: 0.1754346515479633
#pos_intra_shift: 0.45828472207

# Cityscapes
#neg_inter_weight: .72
#pos_inter_weight: .18
#pos_intra_weight: .46
#
#neg_inter_shift: .25
#pos_inter_shift: .20
#pos_intra_shift: .25

rec_weight: 0.0
repulsion_weight: 0.0

# CRF Params
crf_weight: 0.0
alpha: .5
beta: .15
gamma: .05
w1: 10.0
w2: 3.0
shift: 0.00
crf_samples: 1000
color_space: "rgb"

reset_probe_steps: ~

# Logging params
n_images: 5
scalar_log_freq: 10
checkpoint_freq: 2
val_freq: 2
hist_freq: 2

hydra:
  run:
    dir: "."
  output_subdir: ~
  #job_logging: "disabled"
  #hydra_logging: "disabled"

I have reduced the val_freq and some other freqs to match the small dataset, but while running the train_segmentation.py I have encountered the following error,

Global seed set to 0
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[2022-06-24 14:27:21,025][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[2022-06-24 14:27:21,025][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

Missing logger folder: ../logs/exp1/directory_exp1_date_Jun24_14-27-16/lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

   | Name                     | Type                       | Params
-------------------------------------------------------------------------
0  | net                      | DinoFeaturizer             | 21.9 M
1  | train_cluster_probe      | ClusterLookup              | 350   
2  | cluster_probe            | ClusterLookup              | 350   
3  | linear_probe             | Conv2d                     | 355   
4  | decoder                  | Conv2d                     | 27.3 K
5  | cluster_metrics          | UnsupervisedMetrics        | 0     
6  | linear_metrics           | UnsupervisedMetrics        | 0     
7  | test_cluster_metrics     | UnsupervisedMetrics        | 0     
8  | test_linear_metrics      | UnsupervisedMetrics        | 0     
9  | linear_probe_loss_fn     | CrossEntropyLoss           | 0     
10 | crf_loss_fn              | ContrastiveCRFLoss         | 0     
11 | contrastive_corr_loss_fn | ContrastiveCorrelationLoss | 0     
-------------------------------------------------------------------------
230 K     Trainable params
21.7 M    Non-trainable params
21.9 M    Total params
87.601    Total estimated model params size (MB)
Epoch 0:   0% 0/13 [00:00<?, ?it/s] Epoch 0, global step 3: 'test/cluster/mIoU' was not in top 2
Epoch 0, global step 6: 'test/cluster/mIoU' was not in top 2
Epoch 0, global step 9: 'test/cluster/mIoU' was not in top 2
Epoch 0, global step 12: 'test/cluster/mIoU' was not in top 2
Epoch 0:   0% 0/13 [00:03<?, ?it/s, v_num=0]

and no checkpoint of the model is saved (the execution has completed).
I am using Google Collab to run this training.

Any suggestion on what may be the problem (I am aware that the dataset is too small for a strong model) ?

Benben0506 commented 2 years ago

Have you solved this problem? I also have the same problem.

nadavmisgav commented 2 years ago

I am now able to train on my custom dataset, I changed multiple stuff and kinda of lost track. I think the main bits were,

Remove the usage of cfg.num_workers and disable all parallel computing.
Reducing batch sizes and frequencies.

Including my diff.txt in src/* if you want to check it out.

tanveer6715 commented 1 year ago

I am now able to train on my custom dataset, I changed multiple stuff and kinda of lost track. I think the main bits were,

Remove the usage of cfg.num_workers and disable all parallel computing.

Reducing batch sizes and frequencies.

Including my diff.txt in src/* if you want to check it out.

hi. how much cluster mIoU you get on custom dataset?

mhamilton723 / STEGO

Issues with training on custom dataset #31