Questions about training and inference configuration

chufengt commented 2 years ago

Hi,

Thanks for open-sourcing such great work. I have some questions when using this code:

Does the test_lseg.py script support multi-GPU inference? When using a single GPU, it takes about 2~3 hours for inference on ade20k.
I tried to evaluate the provided demo_e200.ckpt on ade20k and got (pixAcc: 0.8078, mIoU: 0.3207), is that correct? It seems lower than the values in the paper.
I trained a model on ade20k (the same config as train.sh, backbone is vit_l16_384) with 8*V100 but found it needs ~90 hours for training 240 epochs. Is it reasonable (it seems much longer than you said in #7)?
When I use this code for other datasets like cityscapes, what changes should I make? The only difference I found is get_labels()in lseg_module.py. Have you evaluated the mIoU on cityscapes?

Thanks in advance.

Boyiliee commented 2 years ago

Hi @chufengt,

Thanks for your interest in LSeg!

Since for test, the time is ok, so currently we don't try multi-GPU inference.
There might be some misunderstanding. Demo model is only for qualitative trial on the fly. For experiments and ablation study, we try on different settings. Please take detailed look at the section 5.1 and experimental setup part of the paper.
Same with 2, please strictly follow the setting of the paper. For all the ablation study such as the results in 5.1, we train LSeg with DPT and a smaller ViT-B/32 backbone together with the CLIP ViT-B/32 text encoder on ADE20k dataset. Therefore, you can follow the training and testing instruction in README. The primary thing to change is to set --backbone clip_vitb32_384, you can check details via this link and #13 .
You should add label files to https://github.com/isl-org/lang-seg/tree/main/label_files and change the get_labels function to choose how to process your label file. For the quantitative results of the paper, we don't evaluate the mIOU on cityscapes.

Hope this helps.

chufengt commented 2 years ago

Hi, @Boyiliee,

Thanks for your reply. It really helps. I have some extra questions:

For the training time mentioned above, I noticed that you said '1-2days for ade20k' in #7, does it measure with vit_b32 or vit_l16? I'm not sure whether the ~90h training time for vit_l16 on ade20k is reasonable or not. The config is the same as train.sh.
Does this code support multi-node (e.g., 8*2 GPUs) training?
When I tried to train LSeg on Cityscapes, I got the 'out of cuda memory' error with the crop size of 768 (line 31 in lseg_module.py), but using 480 is ok. The backbone is vit_l16 and I use 32G V100 * 8. Is it reasonable?
For Cityscapes, I got mIoU≈60% with the vit_l16 backbone. Other configs are the same as train.sh. It seems much lower than the SoTA results on semantic segmentation. Can you give me some suggestions about how to improve the results?

Thanks again.

chufengt commented 2 years ago

Another quick question.

In test_lseg.py:

scales = (
        [0.75, 1.0, 1.25, 1.5, 1.75, 2.0, 2.25]
        if "citys" in args.dataset
        else [0.5, 0.75, 1.0, 1.25, 1.5, 1.75]
    )

Could you give some references for these selected scales? I'm not very familiar with semantic segmentation but I found different scales used in HRNet: https://github.com/HRNet/HRNet-Semantic-Segmentation

Performance on the Cityscapes dataset. The models are trained and tested with the input size of 512x1024 and 1024x2048 respectively. If multi-scale testing is used, we adopt scales: 0.5,0.75,1.0,1.25,1.5,1.75.

Performance on the ADE20K dataset. The models are trained and tested with the input size of 520x520. If multi-scale testing is used, we adopt scales: 0.5,0.75,1.0,1.25,1.5,1.75,2.0 (the same as EncNet, DANet etc.).

Boyiliee commented 2 years ago

We don't conduct experiments on cityscapes. For semantic segmentation, we strictly follow the setting of DPT: https://github.com/isl-org/DPT. Please find the github for more details. Hope this helps!

chufengt commented 2 years ago

Hi, @Boyiliee,

Thanks for your reply.

It seems that DPT did not release the training code as well as the detailed settings for semantic segmentation.

how about the training time for ade20k mentioned above? is it reasonable?
which scale range did you use for ade20k evaluation? is it [0.5, 0.75, 1.0, 1.25, 1.5, 1.75]?

Thanks again.

TB5z035 commented 2 years ago

Hi, @Boyiliee,

Thanks for your reply. It really helps. I have some extra questions:

For the training time mentioned above, I noticed that you said '1-2days for ade20k' in Training configuration #7, does it measure with vit_b32 or vit_l16? I'm not sure whether the ~90h training time for vit_l16 on ade20k is reasonable or not. The config is the same as train.sh.

Does this code support multi-node (e.g., 8*2 GPUs) training?

When I tried to train LSeg on Cityscapes, I got the 'out of cuda memory' error with the crop size of 768 (line 31 in lseg_module.py), but using 480 is ok. The backbone is vit_l16 and I use 32G V100 * 8. Is it reasonable?

For Cityscapes, I got mIoU≈60% with the vit_l16 backbone. Other configs are the same as train.sh. It seems much lower than the SoTA results on semantic segmentation. Can you give me some suggestions about how to improve the results?

Thanks again.

Hi!

Thanks for your work, and it's really impressive. But I would suggest you put the 4th point about adding label files in the README and also raise an error or warning when args.dataset is not ade20k, since the dataset choice is hardcoded in the LSegModule class. This may save like a few hours for anyone who hopes to use your codebase on other datasets.

Thanks again!

isl-org / lang-seg

Questions about training and inference configuration #17