kubeedge / ianvs

Distributed Synergy AI Benchmarking
https://ianvs.readthedocs.io
Apache License 2.0
104 stars 38 forks source link

Guide for running the example of robot/lifelong_learning_bench/semantic-segmentation #106

Open FuryMartin opened 2 weeks ago

FuryMartin commented 2 weeks ago

Introduction or background of this discussion:

Guide for running the example of robot/lifelong_learning_bench/semantic-segmentation

Contents of this discussion:

These days I was trying to run examples/robot/lifelong_learning_bench/semantic-segmentation to learn the use of Ianvs.

However, the entire process of running this example was not so easy. I encountered a series of difficulties in the process. Here, I have recorded the process of running this example and the solutions to the problems encountered. Hopefully they may help others interested in Ianvs.

Besides, for the problems discovered during the trial process, I also provided some suggestions in hopes that they can be addressed by the community.

Ianvs Preparation

I created a new conda environment to run this project on a Ubuntu 22.04 Server. According to the guide #step-1-ianvs-preparation, we choose python 3.9 as our environment

conda create -n ianvs-reproduce python=3.9
conda activate ianvs-reproduce

Then I installed Sedna following the instruction:

pip install ./examples/resources/third_party/*
pip install -r requirements.txt

Then I installed ianvs by executing python setup.py install.

Dataset Preparation

In Step 2, I need to download the dataset. I got the dataset from @hsj576 . The dataset has the following structure:

├── 1280x760
│   ├── gtFine
│   │   ├── test
│   │   ├── train
│   │   └── val
│   ├── rgb
│   │   ├── test
│   │   ├── train
│   │   └── val
│   └── viz
│       ├── test
│       ├── train
│       └── val
├── 2048x1024
│   ├── gtFine
│   │   ├── test
│   │   ├── train
│   │   └── val
│   ├── rgb
│   │   ├── test
│   │   ├── train
│   │   └── val
│   └── viz
│       ├── test
│       ├── train
│       └── val
└── 640x480
    ├── gtFine
    │   ├── test
    │   ├── train
    │   └── val
    ├── json
    │   ├── test
    │   ├── train
    │   └── val
    ├── rgb
    │   ├── test
    │   ├── train
    │   └── val
    └── viz
        ├── test
        ├── train
        └── val

Besides, I got trainging index files from @hsj576 , which containes multiple path pairs as shown below:

rgb/train/20220420_front/00000.png gtFine/train/20220420_front/00000_TrainIds.png
rgb/train/20220420_front/00001.png gtFine/train/20220420_front/00001_TrainIds.png
...

However, the README.md did not point out how the index files should be placed. After some trial and error, I found that all the files in the 2048x1024 folder need to be moved to the directory where the index files are located.

Then, as the guide pointed out, I should configure the dataset URL in testenv.yml. As we could see, there are two folders in ianvs/examples/robot/lifelong_learning_bench/. I tried to edit semantic-segmentation/testenv/testenv.yml in the benchmark project, which looks like this:

testenv:
  # dataset configuration
  dataset:
    # the url address of train dataset index; string type;
    train_url: "/home/shijing.hu/ianvs/dataset/robot_dataset/train-index.txt"
    # the url address of test dataset index; string type;
    test_url: "/home/shijing.hu/ianvs/dataset/robot_dataset/test-index.txt"
  # model eval configuration of incremental learning;
  model_eval:
    # metric used for model evaluation
    model_metric:
      # metric name; string type;
      name: "accuracy"
      # the url address of python file
      url: "./examples/robot/lifelong_learning_bench/testenv/accuracy.py"
      mode: "no-inference"
    ...

I assume the train_url and test_url are what I have to edit. Since the url ./examples/robot/lifelong_learning_bench/testenv/accuracy.py suggests that the root path for this file is ianvs/project/ianvs, and my dataset is in ianvs/project/datasets, I updated the configuration as follows:

testenv:
  # dataset configuration
  dataset:
    # the url address of train dataset index; string type;
    train_url: "../datasets/robot_dataset/train-index.txt"
    # the url address of test dataset index; string type;
    test_url: "../datasets/robot_dataset/test-index.txt"
  # model eval configuration of incremental learning;
  model_eval:
    # metric used for model evaluation
    model_metric:
      # metric name; string type;
      name: "accuracy"
      # the url address of python file
      url: "./examples/robot/lifelong_learning_bench/testenv/accuracy.py"
      mode: "no-inference"
    ...

There were multiple testenv files in testenv/ and I edited them all.

Large Vision Model Preparation

Next, I need to download SAM package and model according to #step-2.5-large-vision-model-preparationoptional. This step went smoothly.

Then, I need to install mmcv and mmdetection. The installation of mmcv is successful following the guide, but there were some issues with installing mmdetection, as shown below.

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Obtaining file:
  Running command python setup.py egg_info
  Traceback (most recent call last):
    File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "./ianvs-reproduce/project/mmdetection/setup.py", line 11, in <module>
      import torch
  ModuleNotFoundError: No module named 'torch'

So I need to install torch by my self. As the guide didn't mention the version of torch, I assumed I needtorch 2.0.0 with cu118 because the download link for mmcv in the guide indicates this version:https://download.openmmlab.com/mmcv/dist/cu118/torch2.0.0/mmcv-2.0.0-cp39-cp39-manylinux1_x86_64.whl.

I install torch + cu118 by the instruction from Previous PyTorch Versions | PyTorch.

pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118

As recommended in the guide, I downloaded the cache.pickle and pretrain_model.pth to the specified path and edited self.resume with the correct path.

Execution and Presentation

I used the code below to try running ianvs:

ianvs -f examples/robot/lifelong_learning_bench/semantic-segmentation/benchmarkingjob-simple.yaml

Then, I found some errors about packages:

  File "./ianvs-reproduce/project/ianvs/core/storymanager/visualization/visualization.py", line 20, in <module>
    from prettytable import from_csv
ModuleNotFoundError: No module named 'prettytable'

and

AttributeError: partially initialized module 'charset_normalizer' has no attribute 'md__mypyc' (most likely due to a circular import)

and

 File "./ianvs-reproduce/lib/python3.9/site-packages/sedna/algorithms/seen_task_learning/seen_task_learning.py", line 22, in <module>
    from sklearn import metrics as sk_metrics
ModuleNotFoundError: No module named 'sklearn'

and

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

and

  File "./examples/robot/lifelong_learning_bench/testalgorithms/rfnet/RFNet/train.py", line 4, in <module>
    from tqdm import tqdm
ModuleNotFoundError: No module named 'tqdm'

  File "/home/***/miniconda3/envs/ianvs-reproduce/lib/python3.9/site-packages/torch/utils/tensorboard/__init__.py", line 1, in <module>
    import tensorboard
ModuleNotFoundError: No module named 'tensorboard'

  File "./examples/robot/lifelong_learning_bench/testalgorithms/rfnet/RFNet/eval.py", line 26, in <module>
    from transformers import SegformerFeatureExtractor, SegformerForSemanticSegmentation
ModuleNotFoundError: No module named 'transformers'

I used the code below to fix the missing package issue:

pip install prettytable scikit-learn tqdm tensorboard transformers charset_normalizer==3.1.0 numpy==1.26.4

When I reran the ianvs command, I got an error:

(ianvs-reproduce) **@server:~/data/OSSP/ianvs-reproduce/project/ianvs$ ianvs -f examples/robot/lifelong_learning_bench/semantic-segmentation/benchmarkingjob-simple.yaml                                                                                
Traceback (most recent call last):
  File "./ianvs-reproduce/project/ianvs/core/cmd/benchmarking.py", line 36, in main
    job = BenchmarkingJob(config[str.lower(BenchmarkingJob.__name__)])
  File "./ianvs-reproduce/project/ianvs/core/cmd/obj/benchmarkingjob.py", line 50, in __init__
    self._parse_config(config)
  File "./ianvs-reproduce/project/ianvs/core/cmd/obj/benchmarkingjob.py", line 103, in _parse_config
    self._parse_testenv_config(v)
  File "./ianvs-reproduce/project/ianvs/core/cmd/obj/benchmarkingjob.py", line 116, in _parse_testenv_config
    raise RuntimeError(f"not found testenv config file({config_file}) in local")
RuntimeError: not found testenv config file(./examples/robot/lifelong_learning_bench/testenv/testenv-robot.yaml) in local

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/**/miniconda3/envs/ianvs-reproduce/bin/ianvs", line 33, in <module>
    sys.exit(load_entry_point('ianvs==0.1.0', 'console_scripts', 'ianvs')())
  File "./ianvs-reproduce/project/ianvs/core/cmd/benchmarking.py", line 41, in main
    raise RuntimeError(f"benchmarkingjob runs failed, error: {err}.") from err
RuntimeError: benchmarkingjob runs failed, error: not found testenv config file(./examples/robot/lifelong_learning_bench/testenv/testenv-robot.yaml) in local.

It appears that there is a path issue. After examining the structure of this example, I realized that I can resolve it by moving all the files from ./examples/robot/lifelong_learning_bench/semantic-segmentation to ./examples/robot/lifelong_learning_bench.

After making this change and running the command, I encountered new exceptions:

(ianvs-reproduce) $:~/data/OSSP/ianvs-reproduce/project/ianvs$ ianvs -f examples/robot/lifelong_learning_bench/benchmarkingjob-simple.yaml
un_classes:30
Upsample layer: in = 128, skip = 64, out = 128
Upsample layer: in = 128, skip = 128, out = 128
Upsample layer: in = 128, skip = 256, out = 128
128
Model loaded successfully!
Traceback (most recent call last):
  File "/home/**/ianvs-reproduce/project/ianvs/core/testcasecontroller/testcase/testcase.py", line 74, in run
    res, system_metric_info = paradigm.run()
  File "/home/**/ianvs-reproduce/project/ianvs/core/testcasecontroller/algorithm/paradigm/lifelong_learning/lifelong_learning.py", line 166, in run
    dataset_files = self._split_dataset(splitting_dataset_times=rounds)
  File "/home/**/ianvs-reproduce/project/ianvs/core/testcasecontroller/algorithm/paradigm/lifelong_learning/lifelong_learning.py", line 433, in _split_dataset
    output_dir=self.dataset_output_dir(),
  File "/home/**/ianvs-reproduce/project/ianvs/core/testcasecontroller/algorithm/paradigm/base.py", line 69, in dataset_output_dir
    os.makedirs(output_dir)
  File "/home/**/miniconda3/envs/ianvs-reproduce/lib/python3.9/os.py", line 215, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/home/**/miniconda3/envs/ianvs-reproduce/lib/python3.9/os.py", line 215, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/home/**/miniconda3/envs/ianvs-reproduce/lib/python3.9/os.py", line 215, in makedirs
    makedirs(head, exist_ok=exist_ok)
  [Previous line repeated 3 more times]
  File "/home/**/miniconda3/envs/ianvs-reproduce/lib/python3.9/os.py", line 225, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/ianvs'

Obviously, it was also a path issue. I then searched /ianvs in the project folder and discovered the workspace in benchmarkingjob-simple.yaml and benchmarkingjob-simple.yaml needed to be reconfigured.

In the next stage, I encounterd more problems aboud path like below:

Traceback (most recent call last):
  File "/home/**/ianvs-reproduce/project/ianvs/core/cmd/benchmarking.py", line 37, in main
    job.run()
  File "/home/**/ianvs-reproduce/project/ianvs/core/cmd/obj/benchmarkingjob.py", line 93, in run
    succeed_testcases, test_results = self.testcase_controller.run_testcases(self.workspace)
  File "/home/**/ianvs-reproduce/project/ianvs/core/testcasecontroller/testcasecontroller.py", line 56, in run_testcases
    raise RuntimeError(f"testcase(id={testcase.id}) runs failed, error: {err}") from err
RuntimeError: testcase(id=e139c552-2c87-11ef-b834-b42e99a3b90d) runs failed, error: (paradigm=lifelonglearning) pipeline runs failed, error: [Errno 2] No such file or directory: '/home/hsj/ianvs/project/cache.pickle'
Traceback (most recent call last):
  File "/home/**/miniconda3/envs/ianvs-reproduce/bin/ianvs", line 33, in <module>
    sys.exit(load_entry_point('ianvs==0.1.0', 'console_scripts', 'ianvs')())
  File "/home/**/ianvs-reproduce/project/ianvs/core/cmd/benchmarking.py", line 41, in main
    raise RuntimeError(f"benchmarkingjob runs failed, error: {err}.") from err
RuntimeError: benchmarkingjob runs failed, error: testcase(id=46211dbc-2c88-11ef-a03f-b42e99a3b90d) runs failed, error: (paradigm=lifelonglearning) pipeline runs failed, error: [Errno 2] No such file or directory: '/home/hsj/ianvs/project/segment-anything/sam_vit_h_4b8939.pth'.

After fixing these problems, I could run this project.

[2024-06-17 20:56:54,847] task_evaluation.py(69) [INFO] - front_semantic_segamentation_model scores: {'accuracy': 0.5691549465958629}
[2024-06-17 20:56:54,852] lifelong_learning.py(449) [INFO] - Task evaluation finishes.
[2024-06-17 20:56:54,852] lifelong_learning.py(452) [INFO] - upload kb index from index.pkl to ../sam-workspace/benchmarkingjob/sam_rfnet_lifelong_learning/f74a8748-2ca8-11ef-82f0-4125e9124177/output/eval/0/index.pkl
[2024-06-17 20:56:54,852] lifelong_learning.py(208) [INFO] - train from round 0
[2024-06-17 20:56:54,853] lifelong_learning.py(209) [INFO] - test round 1
[2024-06-17 20:56:54,853] lifelong_learning.py(210) [INFO] - all scores: {'accuracy': 0.5691549465958629}
[2024-06-17 20:56:54,853] lifelong_learning.py(220) [INFO] - front_semantic_segamentation_model scores: {'accuracy': 0.5691549465958629}
[2024-06-17 20:56:54,853] lifelong_learning.py(443) [INFO] - Download kb index from ../sam-workspace/benchmarkingjob/sam_rfnet_lifelong_learning/f74a8748-2ca8-11ef-82f0-4125e9124177/output/train/0/index.pkl to index.pkl
load model url:  ../sam-workspace/benchmarkingjob/sam_rfnet_lifelong_learning/f74a8748-2ca8-11ef-82f0-4125e9124177/output/train/0/seen_task/front_semantic_segamentation_model.pth
:   0%|                                                                              | 0/4 [00:00<?, ?it/s][Save] save rfnet prediction:  ../sam-workspace/benchmarkingjob/sam_rfnet_lifelong_learning/f74a8748-2ca8-11ef-82f0-4125e9124177/output/eval/0/front/00187.png_origin.png
:  25%|█████████████████▌                                                    | 1/4 [00:00<00:02,  1.37it/s][Save] save rfnet prediction:  ../sam-workspace/benchmarkingjob/sam_rfnet_lifelong_learning/f74a8748-2ca8-11ef-82f0-4125e9124177/output/eval/0/front/00190.png_origin.png
:  50%|███████████████████████████████████                                   | 2/4 [00:01<00:01,  1.33it/s][Save] save rfnet prediction:  ../sam-workspace/benchmarkingjob/sam_rfnet_lifelong_learning/f74a8748-2ca8-11ef-82f0-4125e9124177/output/eval/0/front/00192.png_origin.png
:  75%|████████████████████████████████████████████████████▌                 | 3/4 [00:02<00:00,  1.32it/s][Save] save rfnet prediction:  ../sam-workspace/benchmarkingjob/sam_rfnet_lifelong_learning/f74a8748-2ca8-11ef-82f0-4125e9124177/output/eval/0/front/00195.png_origin.png
: 100%|██████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.35it/s]
Found 4 test RGB images
Found 4 test disparity images
:   0%|                                                                              | 0/4 [00:00<?, ?it/s](1, 1024, 2048) (1, 1024, 2048)
:  25%|█████████████████▌                                                    | 1/4 [00:00<00:00,  6.76it/s](1, 1024, 2048) (1, 1024, 2048)
:  50%|███████████████████████████████████                                   | 2/4 [00:00<00:00,  6.77it/s](1, 1024, 2048) (1, 1024, 2048)
:  75%|████████████████████████████████████████████████████▌                 | 3/4 [00:00<00:00,  6.77it/s](1, 1024, 2048) (1, 1024, 2048)
: 100%|██████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  6.60it/s]
-----------Acc of each classes-----------
road         : 99.818096 %
sidewalk     : nan %
building     : 96.967190 %
wall         : nan %
fence        : nan %
pole         : 0.000000 %
traffic light: nan %
traffic sign : nan %
vegetation   : 98.160555 %
terrain      : nan %
sky          : nan %
person       : nan %
rider        : nan %
car          : nan %
truck        : nan %
bus          : nan %
train        : nan %
motorcycle   : nan %
bicycle      : nan %
stair        : 99.408024 %
curb         : nan %
ramp         : nan %
runway       : nan %
flowerbed    : nan %
door         : nan %
CCTV camera  : nan %
Manhole      : nan %
hydrant      : nan %
belt         : nan %
dustbin      : nan %
-----------IoU of each classes-----------
road         : 99.436234 %
sidewalk     : nan %
building     : 96.699620 %
wall         : nan %
fence        : nan %
pole         : 0.000000 %
traffic light: nan %
traffic sign : nan %
vegetation   : 86.071165 %
terrain      : nan %
sky          : 0.000000 %
person       : nan %
rider        : nan %
car          : nan %
truck        : nan %
bus          : nan %
train        : nan %
motorcycle   : nan %
bicycle      : nan %
stair        : 98.085159 %
curb         : nan %
ramp         : nan %
runway       : nan %
flowerbed    : nan %
door         : nan %
CCTV camera  : nan %
Manhole      : nan %
hydrant      : nan %
belt         : nan %
dustbin      : nan %
-----------FWIoU of each classes-----------
road         : 36.936448 %
sidewalk     : 29.667129 %
-----------freq of each classes-----------
road         : 37.145863 %
sidewalk     : 0.000000 %
building     : 30.679675 %
wall         : 0.000000 %
fence        : 0.000000 %
pole         : 0.065531 %
traffic light: 0.000000 %
traffic sign : 0.000000 %
vegetation   : 5.160104 %
terrain      : 0.000000 %
sky          : 0.000000 %
person       : 0.000000 %
rider        : 0.000000 %
car          : 0.000000 %
truck        : 0.000000 %
bus          : 0.000000 %
train        : 0.000000 %
motorcycle   : 0.000000 %
bicycle      : 0.000000 %
stair        : 26.948826 %
curb         : 0.000000 %
ramp         : 0.000000 %
runway       : 0.000000 %
flowerbed    : 0.000000 %
door         : 0.000000 %
CCTV camera  : 0.000000 %
Manhole      : 0.000000 %
hydrant      : 0.000000 %
belt         : 0.000000 %
dustbin      : 0.000000 %
CPA:0.7887077301139684, mIoU:0.633820295720513, fwIoU: 0.9747773749178817

...

However, there still seems to be some bugs. For example, [rank.py]() has something like

https://github.com/kubeedge/ianvs/blob/7ea4f4af57114ce3179cd0c0773a4254c5999715/core/storymanager/rank/rank.py#L178

which could cause exception as below:

Traceback (most recent call last):
  File "/home/**/ianvs-reproduce/project/ianvs/core/cmd/benchmarking.py", line 37, in main
    job.run()
  File "/home/**/ianvs-reproduce/project/ianvs/core/cmd/obj/benchmarkingjob.py", line 96, in run
    self.rank.save(succeed_testcases, test_results, output_dir=self.workspace)
  File "/home/**/ianvs-reproduce/project/ianvs/core/storymanager/rank/rank.py", line 260, in save
    self._save_all()
  File "/home/**/ianvs-reproduce/project/ianvs/core/storymanager/rank/rank.py", line 178, in _save_all
    all_df.index = pd.np.arange(1, len(all_df) + 1)
AttributeError: module 'pandas' has no attribute 'np'

Finally, we could see the csv output after removing the prefix pd:

rank algorithm BWT MATRIX accuracy task_avg_acc samples_transfer_ratio FWT paradigm basemodel task_definition task_allocation unseen_sample_recognition basemodel-learning_rate basemodel-epochs task_definition-origins task_allocation-origins unseen_sample_recognition-threhold time url
1  -0.0020555379043278497  0.6102329717747815 0.6007481221422825 0.6024 -0.0022857764312219273            

However, the output still seems to have some problems like:

But in the end, we have accomplished the entire process of the example.

Advice

Overall, due to the omission of documentation and hard-coded configuration in the code, running this project is not a easy thing. To address this issue, I recommend:

hsj576 commented 2 weeks ago

Thanks for your suggestions! The data missing problem may caused by the wrong version of pandas, you could use "pip install pandas==1.1.5" to install the correct version of pandas instead of removing the prefix "pd" in "all_df.index = pd.np.arange(1, len(all_df) + 1)".

FuryMartin commented 2 weeks ago

Thanks for your suggestions! The data missing problem may caused by the wrong version of pandas, you could use "pip install pandas==1.1.5" to install the correct version of pandas instead of removing the prefix "pd" in "all_df.index = pd.np.arange(1, len(all_df) + 1)".

Thanks, it works. Now I can get the complete output:

rank algorithm accuracy task_avg_acc paradigm basemodel task_definition task_allocation unseen_sample_recognition basemodel-learning_rate basemodel-epochs task_definition-origins task_allocation-origins unseen_sample_recognition-threhold time url
1 sam_rfnet_lifelong_learning 0.6403945099540626 0.6142496824998978 lifelonglearning BaseModel TaskDefinitionByOrigin TaskAllocationByOrigin HardSampleMining 0.0001 1 ['front',,'garden'] ['front',,'garden'] 0.95 2024-06-18,09:19:25 ../sam-workspace/benchmarkingjob/sam_rfnet_lifelong_learning/f3ec718c-2d0d-11ef-82f0-4125e9124177
hsj576 commented 2 weeks ago

Thanks for your suggestions! The data missing problem may caused by the wrong version of pandas, you could use "pip install pandas==1.1.5" to install the correct version of pandas instead of removing the prefix "pd" in "all_df.index = pd.np.arange(1, len(all_df) + 1)".

Thanks, it works. Now I can get the complete output:

rank algorithm accuracy task_avg_acc paradigm basemodel task_definition task_allocation unseen_sample_recognition basemodel-learning_rate basemodel-epochs task_definition-origins task_allocation-origins unseen_sample_recognition-threhold time url 1 sam_rfnet_lifelong_learning 0.6403945099540626 0.6142496824998978 lifelonglearning BaseModel TaskDefinitionByOrigin TaskAllocationByOrigin HardSampleMining 0.0001 1 ['front',,'garden'] ['front',,'garden'] 0.95 2024-06-18,09:19:25 ../sam-workspace/benchmarkingjob/sam_rfnet_lifelong_learning/f3ec718c-2d0d-11ef-82f0-4125e9124177

Good job!

MooreZheng commented 1 week ago

Good to see the detailed guide. It could be used to enrich the origin one and you might want to contribute a new pull request on https://github.com/kubeedge/ianvs/blob/main/examples/robot/lifelong_learning_bench/semantic-segmentation/README.md .