m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
11.77k stars 1.24k forks source link

How to use multi GPU #391

Open dh-Kang opened 1 year ago

dh-Kang commented 1 year ago

Hi. sometimes, I need to transcribe recorded files that are tens of seconds to a few minutes long. I used whisperX, but it takes long time for just transcribe. (I don't use align and diarization) It normally takes about half day. I have 8GPUs, but whisperX use only one GPU while it is running. I could find v3 branch support multi GPU and v3 branch is merged into main branch. How can I use full GPUs?

whisperx.load_model('large-v2', 'cuda', compute_type='float16', device_index=list(range(8)), language="ko")

I load model with above code. And I tried to use batch size 16 and 64. But, Always, whisperX use only any one of 8 GPUs

chenrq2005 commented 1 year ago

seems like the load_model function can only handle single int as device_index, but the fasterWhisper object handle int list when the fasterWhisper model being defined.

chenrq2005 commented 1 year ago

in transcribe.py I modified these lines, seems like it worked, doing more verification. 26: parser.add_argument("--device_index", default=0, type=Union[int, List[int]], help="device index to use for FasterWhisper inference") 86: device_index: Union[int, List[int]] = args.pop("device_index")

kyleboddy commented 1 year ago

As far as I can tell using multi-GPU affords zero speedup. The code appears to just round-robin access the GPUs after loading the model into each of them.

This is before attempting to merge @chenrq2005's code. Not sure why that would work but the initial code does not considering whisperx/faster_whisper will definitely use multiGPU but does not appear to use them in any way that helps performance.

EDIT: I suppose that could be loading the encoder and decoder to each GPU in the graph below, but I have my doubts.

kyleboddy commented 1 year ago

GPU-toggle

Watch this GIF of nvtop being used. The GPUs are just used in round robin fashion, not concurrently. (Only my P40s are visible to the machine)

chenrq2005 commented 1 year ago

Thanks @kyleboddy, based on my investigation I do not think the model will process a single audio file with multiple GPUs concurrently.

kyleboddy commented 1 year ago

It could do this via chunking and splitting, then combining; there's logic out there that does it. But I don't believe this library handles it natively. Could make for a good pull request in the future.

kyleboddy commented 1 year ago

After running it with a single GPU and comparing, the multiGPU method using DEVICE_INDEX_LIST or similar is definitely used but only reduces memory pressure (by a good amount though, nearly half). It does not afford any computational speedup in its native setting.

Ran a quick test. Single GPU, Tesla P40 on a ~45 minute discussion along with some other stuff (not entirely whisper):

real    8m37.289s 
user    7m35.752s
sys     0m26.459s

Multi-GPU, 2x Tesla P40s, DEVICE_INDEX_LIST = [0.1]

real    8m15.656s
user    7m31.176s
sys     0m26.772s

No actual improvement. Memory pressure as noted was lower per GPU, but that's it.

Both using fp32, batch size 16.

m-bain commented 1 year ago

Thanks for the investigation kyle, yes it should be possible to shard to the audio segments over multiple GPUs and transcribe/align in parallel, will note this as future work

supratim1121992 commented 1 year ago

I am using the WhisperX model on an EC2 instance with 8 V100 GPUs (16 GB each) and trying to process multiple files asynchronously using the ThreadPoolExecutor. The process runs fine but over time (after successfully processing over 500-1000 files) results in an error erratically with two different error messages:

  1. CUDA error: misaligned address CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
  2. CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

I have already tried looking into these errors and performed the following:

This is how I am loading the WhisperX model and using it:

input_asr_options` =  {"beam_size": 3,"best_of": 3,"patience": 1,"length_penalty": 1,
                      "temperatures": [0.0, 0.2, 0.4, 0.6, 0.8, 1.0],"compression_ratio_threshold": 2.4,
                      "log_prob_threshold": -1.0,"no_speech_threshold": 0.6,"condition_on_previous_text": False,
                      "initial_prompt": None,"prefix": None,"suppress_blank": True,"suppress_tokens": [-1],
                      "without_timestamps": True,"max_initial_timestamp": 0.0,"word_timestamps": False,
                      "prepend_punctuations": "\"'“¿([{-","append_punctuations": "\"'.。,,!!??::”)]}、",
                      "suppress_numerals": False}
mod_trans = whisperx.load_model("medium.en", device = "cuda", device_index = [0,1,2,3,4,5,6,7], compute_type = "float16",
                                language = "en", asr_options = input_asr_options)
#mod_trans.transcribe = torch.compile(mod_trans.module.transcribe, mode = "max-autotune", backend = "inductor")
#mod_trans = torch.nn.DataParallel(mod_trans)

def download_transcribe(s3_key):
        proc_time = datetime.datetime.now().strftime("%Y-%m-%d_%H_%M_%S")
        try:
            with tempfile.NamedTemporaryFile(suffix = "." + fl_format, delete = True) as temp_in:
                # Downloading the file from S3 bucket
                s3_con.download_file(s3_in, s3_key, temp_in.name)
                # Loading the file and transcribing
                whisper_aud = whisperx.load_audio(temp_in.name)
                whisper_results = mod_trans.transcribe(whisper_aud, batch_size=32, num_workers = 1, language = "en")
                text = re.sub(r'\s+', ' ', "".join(seg["text"] for seg in whisper_results["segments"])).strip()
            return(text)

        except Exception as e:
            logging.error("Error encountered in file download and transcription: %s", e)
            return "Error"

# Function to perform multithreading using the above function with multiple workers (upto 8)
def transcribe_multi(s3_keys):
        with concurrent.futures.ThreadPoolExecutor(max_workers=num_gpu) as executor:
            futures = [executor.submit(download_transcribe, key) for key in s3_keys]
            results = [future.result() for future in concurrent.futures.as_completed(futures)]
        logging.info("Call transcription completed")
        return results

Any help or suggestions would be appreciated as I can't really get to the root cause of the error.

kyleboddy commented 1 year ago

I've been using ThreadPoolExecutor() as well to run on 3x Tesla P40 GPUs and still run into CUDA error: an illegal memory access was encountered every so often. It almost always occurs when I accidentally load a model to a GPU that has concurrent processing on it. I've added multiple GPU memory checks using GPUtil and other methods to ensure no overlapping, and that seems to solve the problem to like 99.999% so far.

Only suggestion is to add graceful error handling and restarting the state of the application, keeping track of the files using a locking semaphore or similar and reprocess after failure. Never got to the bottom of the actual issue.

EDIT: It sounds like you are using one file/model per GPU, correct? (This is how I handle it)

If not and you are instead trying to use multithreading using the native implementation, that is definitely going to cause OOM and other collision errors even though it possibly should not. I use a semaphore and load the model to each GPU, detecting which one is free / has lowest memory usage, and do the encoding/decoding for a given file entirely on one GPU, then go to the next one for the next file, etc.

supratim1121992 commented 1 year ago

I'm loading the model on all 8 GPUs using device_index as can be seen in the code I shared. I'm not loading the files on any specific GPU for processing. The multiple threads created are being processed by the GPUs which are not utilised at the moment as I've set the number of threads created to be lesser than the number of GPUs available (8 in my case). Each thread processes one file at a time. After quite a bit of tinkering and experimentation, I was finally able to resolve this error by upgrading my Pytorch version to the latest nightly release with Cuda 12.1 and the cuDNN version to 8.9.4 after referring to the support matrix available on Nvidia's webpage. I have been running the process over the past few days and have already processed over 50k audio files successfully without the error recurring even once. Hope that helps in resolving the issue in your case as well.

DigilConfianz commented 1 year ago

@kyleboddy I think threadpool executer failing is because the vad model is loaded on the wrong device(just cuda, with no device number mentioned.) rewriting the https://github.com/m-bain/whisperX/blob/e94b9043085c32c365b2b60f23e73b2d03c2241c/whisperx/asr.py#L25C11-L25C11

https://github.com/m-bain/whisperX/blob/e94b9043085c32c365b2b60f23e73b2d03c2241c/whisperx/asr.py#L105

to vad_model = load_vad_model(torch.device(f"{device}:{device_index}"), use_auth_token=None, **default_vad_options) and passing the device as "cuda" and device_index to the device number helped me solve the issue.

jamesmcarthur115555 commented 1 year ago

Csn you put the code into docker container , then launch 8 containers passing a different GPU device index to each container. The code inside container only sees one device. This would possibly guarantee each device is used. If you are using less memory ina. Particular device you could maybe use MIG ifnavsiosbke to further partition the gpus

lazyseacow commented 10 months ago

@kyleboddy Thank you for your answer that helped me to solve most of the doubts. Have you ever tried to create multiple models with different GPU device numbers for each model and then use multithreading for concurrency? I'm not sure if this is feasible.

kyleboddy commented 10 months ago

I have not. I don't require that speed up as I'm running on Tesla P40s as-is; I'd switch to newer hardware first before trying to optimize multithreading. Sorry! IMO it is not possible to get a speed up using multiple GPUs.

ewwink commented 6 months ago

by using device_index=[0,1] it does use multiple GPU but it not well optimized. for optimized multi GPU/TPU support see Whisper-Jax but the downside, the model twice larger then faster-whisper so it maybe slower up to 100%