huawei-noah / vega

AutoML tools chain
http://www.noahlab.com.hk/opensource/vega/
Other
842 stars 175 forks source link

A question about the fully train step of CARS for the Cutsom ClassificationDataset. Thanks! #110

Closed YTW0518 closed 3 years ago

YTW0518 commented 3 years ago

When I use the classification data set defined by myself, there is no problem in the NAS step, but in the fully train step, there is a problem of feature size mismatch, as follows: (the classification image size is 256×256×3, the num_classes is 21, the batch_size in train is 2) Looking forward to your reply, thank you very very much!!!

INFO Clean worker folder /home/wyt/tasks/0512.004010.920/workers/nas. INFO ------------------------------------------------ INFO Step: fully_train INFO ------------------------------------------------ INFO init TrainPipeStep... INFO TrainPipeStep started... INFO Model was created. ERROR Failed to run pipeline. ERROR Traceback (most recent call last): File "/home/wyt/.local/lib/python3.7/site-packages/vega/core/pipeline/pipeline.py", line 69, in run PipeStep().do() File "/home/wyt/.local/lib/python3.7/site-packages/vega/core/pipeline/train_pipe_step.py", line 50, in do self._train_multi_models(records) File "/home/wyt/.local/lib/python3.7/site-packages/vega/core/pipeline/train_pipe_step.py", line 121, in _train_multi_models self._train_single_model(record.desc, record.worker_id, weights_file) File "/home/wyt/.local/lib/python3.7/site-packages/vega/core/pipeline/train_pipe_step.py", line 95, in _train_single_model self._do_single_fully_train(trainer) File "/home/wyt/.local/lib/python3.7/site-packages/vega/core/pipeline/train_pipe_step.py", line 114, in _do_single_fully_train self._train_single_gpu_model(trainer) File "/home/wyt/.local/lib/python3.7/site-packages/vega/core/pipeline/train_pipe_step.py", line 99, in _train_single_gpu_model self.master.run(trainer, evaluator) File "/home/wyt/.local/lib/python3.7/site-packages/vega/core/scheduler/local_master.py", line 47, in run worker.train_process() File "/home/wyt/.local/lib/python3.7/site-packages/zeus/trainer/trainer_base.py", line 133, in train_process self._train_loop() File "/home/wyt/.local/lib/python3.7/site-packages/zeus/trainer/trainer_base.py", line 308, in _train_loop self._train_epoch() File "/home/wyt/.local/lib/python3.7/site-packages/zeus/trainer/trainer_torch.py", line 102, in _train_epoch train_batch_output = self.train_step(batch) File "/home/wyt/.local/lib/python3.7/site-packages/zeus/trainer/trainer_torch.py", line 155, in _default_train_step output = self.model(input) File "/home/wyt/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(*input, kwargs) File "/home/wyt/.local/lib/python3.7/site-packages/zeus/modules/operators/functions/pytorch_fn.py", line 128, in forward return self.call(inputs, *args, *kwargs) File "/home/wyt/.local/lib/python3.7/site-packages/zeus/networks/super_network.py", line 95, in call logits_aux = self.auxiliary_head(s1) File "/home/wyt/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(input, kwargs) File "/home/wyt/.local/lib/python3.7/site-packages/zeus/modules/operators/functions/pytorch_fn.py", line 128, in forward return self.call(inputs, *args, *kwargs) File "/home/wyt/.local/lib/python3.7/site-packages/zeus/modules/operators/functions/pytorch_fn.py", line 106, in call output = model(output) File "/home/wyt/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in call result = self.forward(input, **kwargs) File "/home/wyt/.local/lib/python3.7/site-packages/zeus/modules/operators/functions/pytorch_fn.py", line 395, in forward out = super().forward(x) File "/home/wyt/.local/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward return F.linear(input, self.weight, self.bias) File "/home/wyt/.local/lib/python3.7/site-packages/torch/nn/functional.py", line 1370, in linear ret = torch.addmm(bias, input, weight.t()) RuntimeError: size mismatch, m1: [2 x 277248], m2: [768 x 21] at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:290

zhangjiajin commented 3 years ago

@YTW0518

Hi, please modify the template and resize the image, as follows:

nas:
    trainer:
        type: Trainer
        darts_template_file: "{default_darts_imagenet_template}"

fully_train:
    dataset:
        ref: nas.dataset
        common:
            train_portion: 1.0
        train:
            batch_size: 96
            shuffle: True
            transforms:
                -   type: Resize
                    size: [256, 256]
                -   type: RandomCrop
                    size: [224, 224]
                -   type: RandomHorizontalFlip
                -   type: ToTensor
                -   type: Normalize
                    mean:
                        - 0.49139968
                        - 0.48215827
                        - 0.44653124
                    std:
                        - 0.24703233
                        - 0.24348505
                        - 0.26158768
        val:
            batch_size: 96
            shuffle: False
            transforms:
                -   type: Resize
                    size: [224, 224]
                -   type: ToTensor
                -   type: Normalize
                    mean:
                        - 0.49139968
                        - 0.48215827
                        - 0.44653124
                    std:
                        - 0.24703233
                        - 0.24348505
                        - 0.26158768
        test:
            batch_size: 96
            shuffle: False
            transforms:
                -   type: Resize
                    size: [224, 224]
                -   type: ToTensor
                -   type: Normalize
                    mean:
                        - 0.49139968
                        - 0.48215827
                        - 0.44653124
                    std:
                        - 0.24703233
                        - 0.24348505
                        - 0.26158768
YTW0518 commented 3 years ago

@YTW0518

Hi, please modify the template and resize the image, as follows:

nas:
    trainer:
        type: Trainer
        darts_template_file: "{default_darts_imagenet_template}"

fully_train:
    dataset:
        ref: nas.dataset
        common:
            train_portion: 1.0
        train:
            batch_size: 96
            shuffle: True
            transforms:
                -   type: Resize
                    size: [256, 256]
                -   type: RandomCrop
                    size: [224, 224]
                -   type: RandomHorizontalFlip
                -   type: ToTensor
                -   type: Normalize
                    mean:
                        - 0.49139968
                        - 0.48215827
                        - 0.44653124
                    std:
                        - 0.24703233
                        - 0.24348505
                        - 0.26158768
        val:
            batch_size: 96
            shuffle: False
            transforms:
                -   type: Resize
                    size: [224, 224]
                -   type: ToTensor
                -   type: Normalize
                    mean:
                        - 0.49139968
                        - 0.48215827
                        - 0.44653124
                    std:
                        - 0.24703233
                        - 0.24348505
                        - 0.26158768
        test:
            batch_size: 96
            shuffle: False
            transforms:
                -   type: Resize
                    size: [224, 224]
                -   type: ToTensor
                -   type: Normalize
                    mean:
                        - 0.49139968
                        - 0.48215827
                        - 0.44653124
                    std:
                        - 0.24703233
                        - 0.24348505
                        - 0.26158768

@zhangjiajin Thanks for your reply, I have modified it according to your suggestion. The problem of feature size mismatch was solved, but a new problem appeared, as follows. Looking forward to your reply, thank you very very much!!!

2021-05-13 19:45:55.335 INFO Clean worker folder /home/wyt/tasks/0513.192508.098/workers/nas. 2021-05-13 19:45:55.339 INFO ------------------------------------------------ 2021-05-13 19:45:55.339 INFO Step: fully_train 2021-05-13 19:45:55.339 INFO ------------------------------------------------ 2021-05-13 19:45:55.360 INFO init TrainPipeStep... 2021-05-13 19:45:55.360 INFO TrainPipeStep started... 2021-05-13 19:45:55.535 ERROR Failed to create instance:<class 'zeus.networks.super_network.DartsNetwork'> 2021-05-13 19:45:55.536 ERROR Failed to get model, model_desc={'modules': ['super_network'], 'super_network': {'type': 'DartsNetwork', 'input_size': 224, 'init_channels': 18, 'num_classes': 21, 'auxiliary': True, 'aux_size': 7, 'auxiliary_layer': 9, 'drop_path_prob': 0.2, 'search': False, 'stem': {'type': 'PreOneStem', 'init_channels': 16, 'stem_multi': 3}, 'head': {'type': 'LinearClassificationHead'}, 'cells': {'modules': ['PreTwoStem', 'normal', 'normal', 'normal', 'normal', 'reduce', 'normal', 'normal', 'normal', 'normal', 'reduce', 'normal', 'normal', 'normal', 'normal'], 'normal': {'type': 'NormalCell', 'steps': 4, 'genotype': [['sep_conv_3x3', 2, 0], ['sep_conv_5x5', 2, 1], ['avg_pool_3x3', 3, 1], ['max_pool_3x3', 3, 2], ['dil_conv_5x5', 4, 1], ['max_pool_3x3', 4, 0], ['sep_conv_3x3', 5, 1], ['dil_conv_5x5', 5, 2]], 'concat': [2, 3, 4, 5]}, 'reduce': {'type': 'ReduceCell', 'steps': 4, 'genotype': [['dil_conv_3x3', 2, 0], ['dil_conv_3x3', 2, 1], ['max_pool_3x3', 3, 1], ['skip_connect', 3, 2], ['max_pool_3x3', 4, 1], ['max_pool_3x3', 4, 0], ['dil_conv_3x3', 5, 0], ['dil_conv_3x3', 5, 3]], 'concat': [2, 3, 4, 5]}}}}, msg='NoneType' object does not support item assignment

In addition, I update the .yml profile and template as follows: (1) cars.yml

    backend: pytorch  # pytorch

pipeline: [nas, fully_train]

nas:
    pipe_step:
        type: SearchPipeStep

    dataset:
        type: ClassificationDataset
        common:
            data_path: /home/wyt/dataset/out
            train_portion: 0.5
            num_workers: 8
            drop_last: False
        train:
            shuffle: True
            batch_size: 4
            transforms:
                -   type: Resize
                    size: [256, 256]
                -   type: RandomCrop
                    size: [224, 224]
                -   type: RandomHorizontalFlip
                -   type: ToTensor
                -   type: Normalize
                    mean:
                        - 0.4842
                        - 0.4901
                        - 0.4505
                    std:
                        - 0.1734
                        - 0.1635
                        - 0.1554

        test:
            shuffle: False
            batch_size: 256

    search_algorithm:
        type: CARSAlgorithm
        policy:
            num_individual: 2
            start_ga_epoch: 2
            ga_interval: 2
            select_method: uniform
            warmup: 2

    search_space:
        type: SearchSpace
        modules: ['super_network']
        super_network:
            type: CARSDartsNetwork
            stem:
                type: PreOneStem
                init_channels: 8
                stem_multi: 3
            head:
                type: LinearClassificationHead
            init_channels: 8
            num_classes: 21
            auxiliary: False
            search: True
            cells:
                modules: [
                    'normal', 'normal', 'reduce',
                    'normal', 'normal', 'reduce',
                    'normal', 'normal'
                ]
                normal:
                    type: NormalCell
                    steps: 4
                    genotype:
                      [
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 2, 0 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 2, 1 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 3, 0 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 3, 1 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 3, 2 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 4, 0 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 4, 1 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 4, 2 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 4, 3 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 5, 0 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 5, 1 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 5, 2 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 5, 3 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 5, 4 ],
                      ]
                    concat: [2, 3, 4, 5]
                reduce:
                    type: ReduceCell
                    steps: 4
                    genotype:
                      [
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 2, 0 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 2, 1 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 3, 0 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 3, 1 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 3, 2 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 4, 0 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 4, 1 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 4, 2 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 4, 3 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 5, 0 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 5, 1 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 5, 2 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 5, 3 ],
                      [ ['none', 'max_pool_3x3', 'avg_pool_3x3', 'skip_connect', 'sep_conv_3x3', 'sep_conv_5x5', 'dil_conv_3x3', 'dil_conv_5x5'], 5, 4 ],
                      ]
                    concat: [2, 3, 4, 5]

    trainer:
        type: Trainer
        darts_template_file: "{default_darts_imagenet_template}"
        callbacks: CARSTrainerCallback
        epochs: 4
        optimizer:
            type: SGD
            params:
                lr: 0.025
                momentum: 0.9
                weight_decay: !!float 3e-4
        lr_scheduler:
            type: CosineAnnealingLR
            params:
                T_max: 4
                eta_min: 0.001
        grad_clip: 5.0
        seed: 10
        unrolled: True
        loss:
            type: CrossEntropyLoss
            params:
                sparse: True

fully_train:
    pipe_step:
        type: TrainPipeStep
        models_folder: "{local_base_path}/output/nas/"
    trainer:
        ref: nas.trainer
        epochs: 600
        lr_scheduler:
            type: CosineAnnealingLR
            params:
                T_max: 600.0
                eta_min: 0
        loss:
            type: MixAuxiliaryLoss
            params:
                loss_base:
                    type: CrossEntropyLoss
                aux_weight: 0.4
        seed: 100
        drop_path_prob: 0.2
    evaluator:
        type: Evaluator
        host_evaluator:
            type: HostEvaluator
            metric:
                type: accuracy
    dataset:
        ref: nas.dataset
        common:
            train_portion: 1.0
        train:
            batch_size: 2
            shuffle: True
            transforms:
                -   type: Resize
                    size: [256, 256]
                -   type: RandomCrop
                    size: [224, 224]
                -   type: RandomHorizontalFlip
                -   type: ToTensor
                -   type: Normalize
                    mean:
                        - 0.4842
                        - 0.4901
                        - 0.4505
                    std:
                        - 0.1734
                        - 0.1635
                        - 0.1554

        test:
            batch_size: 36
            shuffle: False
            transforms:
                -   type: Resize
                    size: [224, 224]
                -   type: ToTensor
                -   type: Normalize
                    mean:
                        - 0.4842
                        - 0.4901
                        - 0.4505
                    std:
                        - 0.1734
                        - 0.1635
                        - 0.1554

(2) darts_imagenet.json


    "modules": [
        "super_network"
    ],
    "super_network": {
        "type": "DartsNetwork",
        "input_size": 224,
        "init_channels": 48,
        "num_classes": 21,
        "auxiliary": true,
        "aux_size": 7,
        "auxiliary_layer": 9,
        "drop_path_prob": 0.2,
        "search": false,
        "stem": {
            "type": "PreOneStem",
            "init_channels": 16,
            "stem_multi": 3
        },
        "head": {
            "type": "LinearClassificationHead"
        },
        "cells": {
            "modules": [
                "PreTwoStem",
                "normal",
                "normal",
                "normal",
                "normal",
                "reduce",
                "normal",
                "normal",
                "normal",
                "normal",
                "reduce",
                "normal",
                "normal",
                "normal",
                "normal"
            ],
            "normal": {
                "type": "NormalCell",
                "steps": 4,
                "genotype": [
                    [
                        "skip_connect",
                        2,
                        0
                    ],
                    [
                        "skip_connect",
                        2,
                        1
                    ],
                    [
                        "sep_conv_3x3",
                        3,
                        0
                    ],
                    [
                        "sep_conv_3x3",
                        3,
                        1
                    ],
                    [
                        "sep_conv_3x3",
                        4,
                        1
                    ],
                    [
                        "sep_conv_3x3",
                        4,
                        0
                    ],
                    [
                        "sep_conv_3x3",
                        5,
                        0
                    ],
                    [
                        "sep_conv_3x3",
                        5,
                        1
                    ]
                ],
                "concat": [
                    2,
                    3,
                    4,
                    5
                ]
            },
            "reduce": {
                "type": "ReduceCell",
                "steps": 4,
                "genotype": [
                    [
                        "sep_conv_3x3",
                        2,
                        0
                    ],
                    [
                        "sep_conv_3x3",
                        2,
                        1
                    ],
                    [
                        "sep_conv_3x3",
                        3,
                        0
                    ],
                    [
                        "sep_conv_3x3",
                        3,
                        1
                    ],
                    [
                        "sep_conv_3x3",
                        4,
                        0
                    ],
                    [
                        "sep_conv_3x3",
                        4,
                        1
                    ],
                    [
                        "sep_conv_3x3",
                        5,
                        0
                    ],
                    [
                        "sep_conv_3x3",
                        5,
                        1
                    ]
                ],
                "concat": [
                    2,
                    3,
                    4,
                    5
                ]
            }
        }
    }
}```
zhangjiajin commented 3 years ago

@YTW0518 We found a bug about this issue. Please download the latest code, compile and install it, and then try again.

YTW0518 commented 3 years ago

@zhangjiajin Thank you very much for your timely and effective reply. My problems have been solved very well. Thank you!

zhangjiajin commented 3 years ago

:)