huawei-noah / vega

AutoML tools chain
http://www.noahlab.com.hk/opensource/vega/
Other
841 stars 177 forks source link

Bad results of PBA #256

Open Violonur-PavelBI opened 2 years ago

Violonur-PavelBI commented 2 years ago

Thanks for your impressive work. As a result of running the PBA, I got very low results result: INFO:root: 15: {'flops': 0.556660224, 'params': 11173.962, 'accuracy': 0.4115953947368421, 'accuracy_top1': 0.4115953947368421, 'accuracy_top5': 0.8470394736842105, 'latency': 6.360976584255695}

Can you help in solving this problem

GPU: +-----------------------------------------------------------------------------+ NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 -------------------------------+----------------------+----------------------+ pip list Package Version

aiohttp 3.8.1 aiosignal 1.2.0 albumentations 1.1.0 alembic 1.7.7 anyio 3.5.0 argon2-cffi 21.3.0 argon2-cffi-bindings 21.2.0 async-timeout 4.0.2 asynctest 0.13.0 attrs 21.4.0 Babel 2.9.1 backcall 0.2.0 beautifulsoup4 4.10.0 bleach 4.1.0 bokeh 2.4.2 brotlipy 0.7.0 certifi 2021.10.8 cffi 1.14.6 chardet 4.0.0 charset-normalizer 2.0.12 click 8.0.4 cloudpickle 2.0.0 conda 4.10.3 conda-build 3.21.5 conda-package-handling 1.7.3 cryptography 3.4.8 cycler 0.11.0 dask 2022.2.0 databricks-cli 0.16.4 debugpy 1.5.1 decorator 5.1.0 defusedxml 0.7.1 dill 0.3.5.1 distributed 2022.2.0 dnspython 2.1.0 docker 5.0.3 entrypoints 0.4 filelock 3.0.12 Flask 2.0.3 fonttools 4.30.0 frozenlist 1.3.0 fsspec 2022.2.0 future 0.18.2 gitdb 4.0.9 GitPython 3.1.27 glob2 0.7 googledrivedownloader 0.4 greenlet 1.1.2 gunicorn 20.1.0 HeapDict 1.0.1 idna 2.10 imageio 2.16.1 importlib-metadata 4.11.3 importlib-resources 5.4.0 ipykernel 6.9.2 ipython 7.27.0 ipython-genutils 0.2.0 itsdangerous 2.1.1 jedi 0.18.0 Jinja2 3.0.3 joblib 1.1.0 json5 0.9.6 jsonschema 4.4.0 jupyter-client 7.1.2 jupyter-core 4.9.2 jupyter-server 1.15.6 jupyter-server-proxy 3.2.1 jupyterlab 3.3.2 jupyterlab-pygments 0.1.2 jupyterlab-server 2.11.1 kiwisolver 1.4.0 libarchive-c 2.9 locket 0.2.1 Mako 1.2.0 MarkupSafe 2.0.1 matplotlib 3.5.1 matplotlib-inline 0.1.2 mistune 0.8.4 mkl-fft 1.3.1 mkl-random 1.2.2 mkl-service 2.4.0 mlflow 1.24.0 msgpack 1.0.3 multidict 6.0.2 nbclassic 0.3.7 nbclient 0.5.13 nbconvert 6.4.4 nbformat 5.2.0 nest-asyncio 1.5.4 networkx 2.6.3 noah-vega 1.8.4 notebook 6.4.10 notebook-shim 0.1.0 numpy 1.21.2 ofa 0.1.0.post202111231444 olefile 0.46 onnx 1.11.0 opencv-contrib-python 4.5.5.64 opencv-python 4.5.5.64 opencv-python-headless 4.5.5.64 packaging 21.3 pandas 1.3.5 pandocfilters 1.5.0 parso 0.8.2 partd 1.2.0 pexpect 4.8.0 pickleshare 0.7.5 Pillow 8.4.0 pip 22.1.2 pkginfo 1.7.1 prometheus-client 0.13.1 prometheus-flask-exporter 0.19.0 prompt-toolkit 3.0.20 protobuf 3.19.4 psutil 5.8.0 ptyprocess 0.7.0 pycosat 0.6.3 pycparser 2.20 Pygments 2.10.0 pyOpenSSL 20.0.1 pyparsing 3.0.7 pyrsistent 0.18.1 PySocks 1.7.1 python-dateutil 2.8.2 python-etcd 0.4.5 pytz 2021.3 PyWavelets 1.3.0 PyYAML 5.4.1 pyzmq 22.3.0 qudida 0.0.4 querystring-parser 1.2.4 requests 2.25.1 ruamel-yaml-conda 0.15.100 scikit-image 0.19.2 scikit-learn 1.0.2 scipy 1.7.3 seaborn 0.11.2 Send2Trash 1.8.0 setuptools 58.0.4 simpervisor 0.4 six 1.16.0 smmap 5.0.0 sniffio 1.2.0 sortedcontainers 2.4.0 soupsieve 2.2.1 SQLAlchemy 1.4.32 sqlparse 0.4.2 tabulate 0.8.9 tblib 1.7.0 tensorboardX 2.5 terminado 0.13.3 testpath 0.6.0 thop 0.1.0.post2206102148 threadpoolctl 3.1.0 tifffile 2021.11.2 toolz 0.11.2 torch 1.4.0 torchtext 0.11.0 torchvision 0.5.0 tornado 6.1 tqdm 4.61.2 traitlets 5.1.0 typing-extensions 3.10.0.2 urllib3 1.26.6 wcwidth 0.2.5 webencodings 0.5.1 websocket-client 1.3.1 Werkzeug 2.0.3 wheel 0.36.2 yarl 1.7.2 zict 2.1.0 zipp 3.7.0

zhangjiajin commented 2 years ago

@Violonur-PavelBI

Please provide the version of Python and PyTorch.

Violonur-PavelBI commented 2 years ago

Python 3.6.9 PyTorch 1.3.0a0+24ae9b5

zhangjiajin commented 2 years ago

@Violonur-PavelBI

We found out why the accuracy is so low. This is because only a very small amount of data is used in the fulltrain phase. We are training using the following configuration:

fully_train:
    pipe_step:
        type: TrainPipeStep
    dataset:
        ref: pba.dataset
        common:
            train_portion: 1.0      # Use full training data.
        train: 
            shuffle: True              # Shuffle during training
...
Violonur-PavelBI commented 2 years ago

@zhangjiajin I launched it with a correction, I understand correctly that such low accuracy is normal at the entire PBA stage?

zhangjiajin commented 2 years ago

@Violonur-PavelBI

Yes, the PBA phase requires only relative comparisons, only less data is used, and the accuracy is low.

Violonur-PavelBI commented 2 years ago

@zhangjiajin It didn't help `INFO:root:flops: 0.5578890240000001 , params:11173.962

INFO:root:Finished the unified trainer successfully. INFO:root:Update Success. step_name=pba, worker_id=15 INFO:root:Best values: [{'worker_id': 6, 'performance': {'flops': 0.5578890240000001, 'params': 11173.962, 'accuracy': 0.27411400139664804, 'accuracy_top1': 0.27411400139664804, 'accuracy_top5': 0.739830656424581}}] INFO:root:Clean worker folder /workspace/proj/vega/vega/tasks/0628.084138.423/workers/pba. INFO:root:------------------------------------------------ INFO:root: Step: fully_train INFO:root:------------------------------------------------ INFO:vega.core.pipeline.train_pipe_step:init TrainPipeStep... INFO:vega.core.pipeline.train_pipe_step:TrainPipeStep started... INFO:root:Model was created. INFO:root:load model weights from file, weights file=/workspace/proj/vega/vega/tasks/0628.084138.423/output/pba/model_6.pth INFO:root:flops: 0.5578890240000001 , params:11173.962 INFO:root:worker id [6], epoch [1/400], train step [ 0/195], loss [ 1.344, 1.344], lr [ 0.1000000], time pre batch [0.519s] , total mean time per batch [0.519s] ... INFO:root:worker id [6], epoch [400/400], current valid perfs [accuracy: 0.132, accuracy_top1: 0.132, accuracy_top5: 0.505], best valid perfs [accuracy: 0.401, accuracy_top1: 0.401, accuracy_top5: 0.835] INFO:root:flops: 0.5578890240000001 , params:11173.962 INFO:root:Finished the unified trainer successfully. INFO:root:start evaluate process INFO:root:Model was created. INFO:root:load model weights from file, weights file=/workspace/proj/vega/vega/tasks/0628.084138.423/workers/fully_train/6/model_6.pth INFO:root:step [1/39], valid metric [[[tensor(0.3945, device='cuda:0'), tensor(0.8242, device='cuda:0')]]] INFO:root:step [11/39], valid metric [[[tensor(0.3945, device='cuda:0'), tensor(0.8398, device='cuda:0')]]] INFO:root:step [21/39], valid metric [[[tensor(0.4102, device='cuda:0'), tensor(0.8555, device='cuda:0')]]] INFO:root:step [31/39], valid metric [[[tensor(0.3984, device='cuda:0'), tensor(0.8086, device='cuda:0')]]] INFO:root:evaluator latency [4.54831620445475] INFO:root:evaluate performance: {'accuracy': 0.4011418269230769, 'accuracy_top1': 0.4011418269230769, 'accuracy_top5': 0.835136217948718, 'latency': 4.54831620445475} INFO:root:finished host evaluation, id: 6, performance: {'accuracy': 0.4011418269230769, 'accuracy_top1': 0.4011418269230769, 'accuracy_top5': 0.835136217948718, 'latency': 4.54831620445475} INFO:root:------------------------------------------------ INFO:root: Pipeline end. INFO:root: INFO:root: task id: 0628.084138.423 INFO:root: output folder: /workspace/proj/vega/vega/tasks/0628.084138.423/output INFO:root: INFO:root: running time: INFO:root: pba: 3:07:25 [2022-06-28 08:41:40.687438 - 2022-06-29 11:49:06.298678] INFO:root: fully_train: 3:30:13 [2022-06-29 11:49:06.388008 - 2022-06-29 15:19:19.973391] INFO:root: INFO:root: result: INFO:root: 6: {'flops': 0.5578890240000001, 'params': 11173.962, 'accuracy': 0.4011418269230769, 'accuracy_top1': 0.4011418269230769, 'accuracy_top5': 0.835136217948718, 'latency': 4.54831620445475} INFO:root:------------------------------------------------`

zhangjiajin commented 2 years ago

@Violonur-PavelBI

Copy that. I'll find the cause of the issue.

Violonur-PavelBI commented 2 years ago

Copy what exactly?

zhangjiajin commented 2 years ago

I mean, after modifying the configuration file, the accuracy is still not good. I will find the cause of the issue. We've found some clues. After the data augmentation method is changed during the training, the previous model may not be correctly loaded to continue the training.

Violonur-PavelBI commented 2 years ago

Thank you very much