How to train on multi-gpu?

hollow-503 commented 1 year ago

Hello, when I run: torchpack dist-run -np 4 python tools/train.py configs/nuscenes/det/centerhead/lssfpn/camera/256x704/swint/default.yaml --model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pth I get an error:

--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
that were requested by the application:
  /opt/conda/envs/xxx/bin/python

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------

i have 2 Gold 6148 and 4 V-100 in single machine, but it still said the slots are not enough. how should i run the train or test codes on multi-gpu in single machine?

kentang-mit commented 1 year ago

Probably you can start with evaluating the pretrained models. If you cannot run multi-gpu inference it would be hard for you to run multi-gpu training. By the way, if your custom setup does not work, we recommend you to try out the docker setup first.

hollow-503 commented 1 year ago

I could only run with: torchpack dist-run -np 1 python tools/test.py configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml pretrained/bevfusion-det.pth --eval bbox and if i set -np 1, i will get the result

if i set -np 2, it will remind me that:

--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2 slots

that were requested by the application:
  /opt/conda/envs/xxx/bin/python

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------

is that means i should evaluate or train with more powerful cpu and gpu ?

hollow-503 commented 1 year ago

Probably you can start with evaluating the pretrained models. If you cannot run multi-gpu inference it would be hard for you to run multi-gpu training. By the way, if your custom setup does not work, we recommend you to try out the docker setup first.

i successfully run with

torchpack dist-run -np 1 python tools/train.py configs/nuscenes/det/centerhead/lssfpn/camera/256x704/swint/default.yaml --model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pth

but only single gpu is working.If i run with -np 2，it remind me that there are no enough slots

kentang-mit commented 1 year ago

Just to confirm, have you tried out our docker setup?

hollow-503 commented 1 year ago

Just to confirm, have you tried out our docker setup?

no，i just tried the custom setup

kentang-mit commented 1 year ago

My suggestion would be to try out the docker setup first. If the docker setup can work, I would suggest you to compare your system setup against our Dockerfile and fix the difference.

hollow-503 commented 1 year ago

My suggestion would be to try out the docker setup first. If the docker setup can work, I would suggest you to compare your system setup against our Dockerfile and fix the difference.

Thanks for your reply, i train it successfully. However, when i run the visualization command： torchpack dist-run -np 1 python tools/visualize.py configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml --checkpoint pretrained/bevfusion-det.pth --out-dir viz/fusion-det-pred --mode pred --box-score 0.99. The result is:

The result seems to be error, although i try tuning the box-score from 0.1 to 0.99. I also try tuning the nms-threshold but in the configs/nuscenes/det/transfusion/defalut.yaml, the nms_type is null and i cannot tune nms-threshold.

My evaluation result with pretrained model is:

kentang-mit commented 1 year ago

@hollow-503,

Would you mind also trying out visualizing the predictions on the camera images? I saw duplicate bounding boxes before, but it seems that your visualizations are wildly off.

Best, Haotian

hollow-503 commented 1 year ago

@hollow-503,

Would you mind also trying out visualizing the predictions on the camera images? I saw duplicate bounding boxes before, but it seems that your visualizations are wildly off.

Best, Haotian

thanks to your quick reply, the predictions on the camera images is:

with command: torchpack dist-run -np 1 python tools/visualize.py configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml --checkpoint pretrained/bevfusion-det.pth --out-dir viz/fusion-det-pred --mode pred --box-score 0.99

hollow-503 commented 1 year ago

@hollow-503,

Would you mind also trying out visualizing the predictions on the camera images? I saw duplicate bounding boxes before, but it seems that your visualizations are wildly off.

Best, Haotian

I solve it by straightly setting bbox-score in visualize.py:

parser.add_argument("--bbox-score", type=float, default=0.04)

and it works now.

However, i am curious about that in the camera+lidar configuration, it seems you set the bbox-score=0:

but the model still get wonderful result. Is there something that i ignore？

kentang-mit commented 1 year ago

I think for our demos we did not use bbox_score=0. If we do camera only we probably did NMS (because it is simpler, you don't have to tune the parameters).

hollow-503 commented 1 year ago

Thanks to your reply! Actually when i reproduce the train fusion model with command: torchpack dist-run -np 8 python tools/train.py configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/convfuser.yaml --load-from pretrained/lidar-only-det.pth following the #189 However, error occurs:

Traceback (most recent call last):
  File "/home/xxx/bevfusion/lib/python3.8/site-packages/yapf/yapflib/pytree_utils.py", line 119, in ParseCodeToTree
    tree = parser_driver.parse_string(code, debug=False)
  File "/home/xxx/bevfusion/lib/python3.8/lib2to3/pgen2/driver.py", line 103, in parse_string
    return self.parse_tokens(tokens, debug)
  File "/home/xxx/bevfusion/lib/python3.8/lib2to3/pgen2/driver.py", line 71, in parse_tokens
    if p.addtoken(type, value, (prefix, start)):
  File "/home/xxx/bevfusion/lib/python3.8/lib2to3/pgen2/parse.py", line 162, in addtoken
    raise ParseError("bad input", type, value, context)
lib2to3.pgen2.parse.ParseError: bad input: type=3, value="'deterministic'", context=('\n', (2, 0))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/xxx/bevfusion/lib/python3.8/site-packages/yapf/yapflib/pytree_utils.py", line 125, in ParseCodeToTree
    tree = parser_driver.parse_string(code, debug=False)
  File "/home/xxx/bevfusion/lib/python3.8/lib2to3/pgen2/driver.py", line 103, in parse_string
    return self.parse_tokens(tokens, debug)
  File "/home/xxx/bevfusion/lib/python3.8/lib2to3/pgen2/driver.py", line 71, in parse_tokens
    if p.addtoken(type, value, (prefix, start)):
  File "/home/xxx/bevfusion/lib/python3.8/lib2to3/pgen2/parse.py", line 162, in addtoken
    raise ParseError("bad input", type, value, context)
lib2to3.pgen2.parse.ParseError: bad input: type=3, value="'deterministic'", context=('\n', (2, 0))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/xxx/bevfusion/lib/python3.8/site-packages/yapf/yapflib/yapf_api.py", line 183, in FormatCode
    tree = pytree_utils.ParseCodeToTree(unformatted_source)
  File "/home/xxx/bevfusion/lib/python3.8/site-packages/yapf/yapflib/pytree_utils.py", line 131, in ParseCodeToTree
    raise e
  File "/home/xxx/bevfusion/lib/python3.8/site-packages/yapf/yapflib/pytree_utils.py", line 129, in ParseCodeToTree
    ast.parse(code)
  File "/home/xxx/bevfusion/lib/python3.8/ast.py", line 47, in parse
    return compile(source, filename, mode, flags,
  File "<unknown>", line 2
    'deterministic': False
    ^
SyntaxError: invalid syntax

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tools/train.py", line 87, in <module>
    main()
  File "tools/train.py", line 51, in main
    logger.info(f"Config:\n{cfg.pretty_text}")
  File "/home/xxx/bevfusion/lib/python3.8/site-packages/mmcv/utils/config.py", line 496, in pretty_text
    text, _ = FormatCode(text, style_config=yapf_style, verify=True)
  File "/home/xxx/bevfusion/lib/python3.8/site-packages/yapf/yapflib/yapf_api.py", line 186, in FormatCode
    raise errors.YapfError(errors.FormatErrorMsg(e))
yapf.yapflib.errors.YapfError: <unknown>:2:1: invalid syntax

And in the funtion init_weights in bevfusion.py, it seems that you do not load any lidar pretrained weights in the training initialization:

    def init_weights(self) -> None:
        if "camera" in self.encoders:
            self.encoders["camera"]["backbone"].init_weights()

Could you tell me the command to train the fusion model?

kentang-mit commented 1 year ago

You may try "load_from" @hollow-503. Detailed experiment configurations will be provided in the future. Please stay tuned.

kentang-mit commented 1 year ago

Closed due to inactivity.

YoushaaMurhij commented 1 year ago

Probably you can start with evaluating the pretrained models. If you cannot run multi-gpu inference it would be hard for you to run multi-gpu training. By the way, if your custom setup does not work, we recommend you to try out the docker setup first.

i successfully run with

torchpack dist-run -np 1 python tools/train.py configs/nuscenes/det/centerhead/lssfpn/camera/256x704/swint/default.yaml --model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pth

but only single gpu is working.If i run with -np 2，it remind me that there are no enough slots

@hollow-503, Could you please tell me how you solved this problem: There are not enough slots available in the system to satisfy the 4 slots ? Thanks

hollow-503 commented 1 year ago

Probably you can start with evaluating the pretrained models. If you cannot run multi-gpu inference it would be hard for you to run multi-gpu training. By the way, if your custom setup does not work, we recommend you to try out the docker setup first.

i successfully run with torchpack dist-run -np 1 python tools/train.py configs/nuscenes/det/centerhead/lssfpn/camera/256x704/swint/default.yaml --model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pth but only single gpu is working.If i run with -np 2，it remind me that there are no enough slots

@hollow-503, Could you please tell me how you solved this problem: There are not enough slots available in the system to satisfy the 4 slots ? Thanks

Actually, i used another server with 8 gpu and it worked. I think it has something to do with your hardware.

mit-han-lab / bevfusion

How to train on multi-gpu? #161