Open dh-Kang opened 1 year ago
seems like the load_model function can only handle single int as device_index, but the fasterWhisper object handle int list when the fasterWhisper model being defined.
in transcribe.py I modified these lines, seems like it worked, doing more verification. 26: parser.add_argument("--device_index", default=0, type=Union[int, List[int]], help="device index to use for FasterWhisper inference") 86: device_index: Union[int, List[int]] = args.pop("device_index")
As far as I can tell using multi-GPU affords zero speedup. The code appears to just round-robin access the GPUs after loading the model into each of them.
This is before attempting to merge @chenrq2005's code. Not sure why that would work but the initial code does not considering whisperx/faster_whisper will definitely use multiGPU but does not appear to use them in any way that helps performance.
EDIT: I suppose that could be loading the encoder and decoder to each GPU in the graph below, but I have my doubts.
Watch this GIF of nvtop
being used. The GPUs are just used in round robin fashion, not concurrently. (Only my P40s are visible to the machine)
Thanks @kyleboddy, based on my investigation I do not think the model will process a single audio file with multiple GPUs concurrently.
It could do this via chunking and splitting, then combining; there's logic out there that does it. But I don't believe this library handles it natively. Could make for a good pull request in the future.
After running it with a single GPU and comparing, the multiGPU method using DEVICE_INDEX_LIST or similar is definitely used but only reduces memory pressure (by a good amount though, nearly half). It does not afford any computational speedup in its native setting.
Ran a quick test. Single GPU, Tesla P40 on a ~45 minute discussion along with some other stuff (not entirely whisper):
real 8m37.289s
user 7m35.752s
sys 0m26.459s
Multi-GPU, 2x Tesla P40s, DEVICE_INDEX_LIST = [0.1]
real 8m15.656s
user 7m31.176s
sys 0m26.772s
No actual improvement. Memory pressure as noted was lower per GPU, but that's it.
Both using fp32, batch size 16.
Thanks for the investigation kyle, yes it should be possible to shard to the audio segments over multiple GPUs and transcribe/align in parallel, will note this as future work
I am using the WhisperX model on an EC2 instance with 8 V100 GPUs (16 GB each) and trying to process multiple files asynchronously using the ThreadPoolExecutor. The process runs fine but over time (after successfully processing over 500-1000 files) results in an error erratically with two different error messages:
TORCH_USE_CUDA_DSA
to enable device-side assertions.TORCH_USE_CUDA_DSA
to enable device-side assertions.I have already tried looking into these errors and performed the following:
nvidia-smi
output:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-SXM2-16GB Off | 00000000:00:17.0 Off | 0 |
| N/A 29C P0 57W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2-16GB Off | 00000000:00:18.0 Off | 0 |
| N/A 30C P0 56W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2-16GB Off | 00000000:00:19.0 Off | 0 |
| N/A 29C P0 58W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2-16GB Off | 00000000:00:1A.0 Off | 0 |
| N/A 30C P0 56W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2-16GB Off | 00000000:00:1B.0 Off | 0 |
| N/A 29C P0 56W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2-16GB Off | 00000000:00:1C.0 Off | 0 |
| N/A 30C P0 54W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2-16GB Off | 00000000:00:1D.0 Off | 0 |
| N/A 31C P0 55W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2-16GB Off | 00000000:00:1E.0 Off | 0 |
| N/A 29C P0 55W / 300W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
batch_size
hyperparameter to 8, 16 & 32. The memory usage only goes upto 50% max but the GPU compute utilization goes upto 100% on multiple GPUs (due to multiple files being processed via multithreading).torch.compile
and torch.nn.DataParallel
to parallelise the model but this still resulted in the error after processing over 500 files successfully.This is how I am loading the WhisperX model and using it:
input_asr_options` = {"beam_size": 3,"best_of": 3,"patience": 1,"length_penalty": 1,
"temperatures": [0.0, 0.2, 0.4, 0.6, 0.8, 1.0],"compression_ratio_threshold": 2.4,
"log_prob_threshold": -1.0,"no_speech_threshold": 0.6,"condition_on_previous_text": False,
"initial_prompt": None,"prefix": None,"suppress_blank": True,"suppress_tokens": [-1],
"without_timestamps": True,"max_initial_timestamp": 0.0,"word_timestamps": False,
"prepend_punctuations": "\"'“¿([{-","append_punctuations": "\"'.。,,!!??::”)]}、",
"suppress_numerals": False}
mod_trans = whisperx.load_model("medium.en", device = "cuda", device_index = [0,1,2,3,4,5,6,7], compute_type = "float16",
language = "en", asr_options = input_asr_options)
#mod_trans.transcribe = torch.compile(mod_trans.module.transcribe, mode = "max-autotune", backend = "inductor")
#mod_trans = torch.nn.DataParallel(mod_trans)
def download_transcribe(s3_key):
proc_time = datetime.datetime.now().strftime("%Y-%m-%d_%H_%M_%S")
try:
with tempfile.NamedTemporaryFile(suffix = "." + fl_format, delete = True) as temp_in:
# Downloading the file from S3 bucket
s3_con.download_file(s3_in, s3_key, temp_in.name)
# Loading the file and transcribing
whisper_aud = whisperx.load_audio(temp_in.name)
whisper_results = mod_trans.transcribe(whisper_aud, batch_size=32, num_workers = 1, language = "en")
text = re.sub(r'\s+', ' ', "".join(seg["text"] for seg in whisper_results["segments"])).strip()
return(text)
except Exception as e:
logging.error("Error encountered in file download and transcription: %s", e)
return "Error"
# Function to perform multithreading using the above function with multiple workers (upto 8)
def transcribe_multi(s3_keys):
with concurrent.futures.ThreadPoolExecutor(max_workers=num_gpu) as executor:
futures = [executor.submit(download_transcribe, key) for key in s3_keys]
results = [future.result() for future in concurrent.futures.as_completed(futures)]
logging.info("Call transcription completed")
return results
Any help or suggestions would be appreciated as I can't really get to the root cause of the error.
I've been using ThreadPoolExecutor() as well to run on 3x Tesla P40 GPUs and still run into CUDA error: an illegal memory access was encountered
every so often. It almost always occurs when I accidentally load a model to a GPU that has concurrent processing on it. I've added multiple GPU memory checks using GPUtil and other methods to ensure no overlapping, and that seems to solve the problem to like 99.999% so far.
Only suggestion is to add graceful error handling and restarting the state of the application, keeping track of the files using a locking semaphore or similar and reprocess after failure. Never got to the bottom of the actual issue.
EDIT: It sounds like you are using one file/model per GPU, correct? (This is how I handle it)
If not and you are instead trying to use multithreading using the native implementation, that is definitely going to cause OOM and other collision errors even though it possibly should not. I use a semaphore and load the model to each GPU, detecting which one is free / has lowest memory usage, and do the encoding/decoding for a given file entirely on one GPU, then go to the next one for the next file, etc.
I'm loading the model on all 8 GPUs using device_index
as can be seen in the code I shared. I'm not loading the files on any specific GPU for processing. The multiple threads created are being processed by the GPUs which are not utilised at the moment as I've set the number of threads created to be lesser than the number of GPUs available (8 in my case). Each thread processes one file at a time. After quite a bit of tinkering and experimentation, I was finally able to resolve this error by upgrading my Pytorch version to the latest nightly release with Cuda 12.1 and the cuDNN version to 8.9.4 after referring to the support matrix available on Nvidia's webpage. I have been running the process over the past few days and have already processed over 50k audio files successfully without the error recurring even once. Hope that helps in resolving the issue in your case as well.
@kyleboddy I think threadpool executer failing is because the vad model is loaded on the wrong device(just cuda, with no device number mentioned.) rewriting the https://github.com/m-bain/whisperX/blob/e94b9043085c32c365b2b60f23e73b2d03c2241c/whisperx/asr.py#L25C11-L25C11
to vad_model = load_vad_model(torch.device(f"{device}:{device_index}"), use_auth_token=None, **default_vad_options) and passing the device as "cuda" and device_index to the device number helped me solve the issue.
Csn you put the code into docker container , then launch 8 containers passing a different GPU device index to each container. The code inside container only sees one device. This would possibly guarantee each device is used. If you are using less memory ina. Particular device you could maybe use MIG ifnavsiosbke to further partition the gpus
@kyleboddy Thank you for your answer that helped me to solve most of the doubts. Have you ever tried to create multiple models with different GPU device numbers for each model and then use multithreading for concurrency? I'm not sure if this is feasible.
I have not. I don't require that speed up as I'm running on Tesla P40s as-is; I'd switch to newer hardware first before trying to optimize multithreading. Sorry! IMO it is not possible to get a speed up using multiple GPUs.
by using device_index=[0,1]
it does use multiple GPU but it not well optimized. for optimized multi GPU/TPU support see Whisper-Jax but the downside, the model twice larger then faster-whisper so it maybe slower up to 100%
Hi. sometimes, I need to transcribe recorded files that are tens of seconds to a few minutes long. I used whisperX, but it takes long time for just transcribe. (I don't use align and diarization) It normally takes about half day. I have 8GPUs, but whisperX use only one GPU while it is running. I could find
v3
branch support multi GPU andv3
branch is merged intomain
branch. How can I use full GPUs?I load model with above code. And I tried to use batch size 16 and 64. But, Always, whisperX use only any one of 8 GPUs