I read that we can get command line by doing this. But how do we achieve parallelism so we can just make the api call and that happens with the number of gpus attached, like 2 or 4 gpus
To enable sequence parallelism, you need to use torchrun to run the inference script. The following command will run the inference with 2 GPUs.
I'm making the api call which works great, but watching, its only using one of the gpus despite having 2 attached
I read that we can get command line by doing this. But how do we achieve parallelism so we can just make the api call and that happens with the number of gpus attached, like 2 or 4 gpus
To enable sequence parallelism, you need to use torchrun to run the inference script. The following command will run the inference with 2 GPUs.
I'm making the api call which works great, but watching, its only using one of the gpus despite having 2 attached