marieai / marie-ai

Integrate AI-powered Document Analysis Pipelines
MIT License
60 stars 5 forks source link

CRASHES WITH CUDA ERROR: DEVICE-SIDE ASSERT TRIGGERED #114

Open gregbugaj opened 5 months ago

gregbugaj commented 5 months ago

Describe the bug

Application crashes the GPU. Sample document ID.

CRASHES WITH CUDA ERROR: DEVICE-SIDE ASSERT TRIGGERED 204092337 203925852 204092927 204092966 204166227 204041606 204040160

205967262 - EOB (medical_page_classifier) 208788841 - CORR 209425570 - CORR 209805214 - CORR 209976466 - CORR 211567153 - CORR 212670800 - CORR 213805705 - CORR / ROTATED 213942700 - CORR / ROTATED 214292051 - CORR 214288815 - CORR 214291267 - CORR / ROTATED 214292900 - CORR / LARGE 214894529 - CORR / ENVELOPE

INFO   marie@37 Executing pipeline for document : PID_1956_9362_0_203925852.tif, lbxid > /tmp/generators/a9de56b33b040d12568f379e0078684a                                           
INFO   marie@37 Executing pipeline runtime_conf : {'name': 'default-corr', 'page_splitter': {'enabled': False}, 'type': 'pipeline', 'page_cleaner': {'enabled':                     
       False}, 'page_classifier': {'enabled': True}}                                                                                                                                
INFO   marie@37 Feature : page classifier enabled : True                                                                                                                            
INFO   marie@37 Feature : page indexer enabled : True                                                                                                                               
INFO   marie@37 Loaded classifiers : corr-classifier, 3                                                                                                                             
INFO   marie@37 Loaded classifiers : corr-payer-classifier, 3                                                                                                                       
INFO   marie@37 Restoring assets from s3://marie/lbxid/pid_1956_9362_0_203925852 to /tmp/generators/a9de56b33b040d12568f379e0078684a                             [05/15/24 14:38:45]
INFO   marie@37 Bursting frames for PID_1956_9362_0_203925852.tif                                                                                                                   
INFO   marie@37 Processing classifier pipeline/group :  default-corr, corr-classifier                                                                                               
../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [166,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
ERROR  marie@37 Error classifying document : CUDA error: device-side assert triggered                                                                            [05/15/24 14:38:45]
       CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                      
       For debugging consider passing CUDA_LAUNCH_BLOCKING=1.                                                                                                                       
       Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.                                                                                                          

Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/marie/components/document_classifier/transformers.py", line 244, in predict
    for results in pipe_batched_results:
  File "/opt/venv/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
  File "/opt/venv/lib/python3.10/site-packages/transformers/pipelines/pt_utils.py", line 125, in __next__
    processed = self.infer(item, **self.params)
  File "/opt/venv/lib/python3.10/site-packages/transformers/pipelines/base.py", line 1067, in forward
    model_inputs = self._ensure_tensor_on_device(model_inputs, device=self.device)
  File "/opt/venv/lib/python3.10/site-packages/transformers/pipelines/base.py", line 972, in _ensure_tensor_on_device
    return UserDict({name: self._ensure_tensor_on_device(tensor, device) for name, tensor in inputs.items()})
  File "/opt/venv/lib/python3.10/site-packages/transformers/pipelines/base.py", line 972, in <dictcomp>
    return UserDict({name: self._ensure_tensor_on_device(tensor, device) for name, tensor in inputs.items()})
  File "/opt/venv/lib/python3.10/site-packages/transformers/pipelines/base.py", line 980, in _ensure_tensor_on_device
    return inputs.to(device)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

ERROR  marie@37 Error while classifying documents: CUDA error: device-side assert triggered                                                                      [05/15/24 14:38:45]
       CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.                                                      
       For debugging consider passing CUDA_LAUNCH_BLOCKING=1.                                                                                                                       
       Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.