marieai / marie-ai

Integrate AI-powered Document Analysis Pipelines
MIT License
58 stars 4 forks source link

Cuda Error: Out of memory #27

Closed gregbugaj closed 1 year ago

gregbugaj commented 2 years ago

This needs to be handled better.

 File "/opt/marie-icr/marie/executor/ner/ner_extraction_executor.py", line 701, in preprocess
    ocr_results, frames = obtain_ocr(src_image, self.text_executor)
  File "/opt/marie-icr/marie/executor/ner/ner_extraction_executor.py", line 78, in obtain_ocr
    results = text_executor.extract(docs, **kwa)
  File "/opt/marie-icr/marie/executor/text_extraction_executor.py", line 369, in extract
    logger.error("Extract error", error)
Message: 'Extract error'
Arguments: (RuntimeError('CUDA out of memory. Tried to allocate 362.00 MiB (GPU 0; 47.54 GiB total capacity; 38.29 GiB already allocated; 358.94 MiB free; 44.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF'),)
--- Logging error ---
Traceback (most recent call last):
  File "/opt/marie-icr/marie/executor/text_extraction_executor.py", line 346, in extract
    results = self.__process_extract_fullpage(
  File "/opt/marie-icr/marie/executor/text_extraction_executor.py", line 164, in __process_extract_fullpage
    result, overlay_image = self.icr_processor.recognize(
  File "/opt/marie-icr/marie/document/icr_processor.py", line 250, in recognize
    raise ex
  File "/opt/marie-icr/marie/document/icr_processor.py", line 119, in recognize
    results = self.recognize_from_fragments(fragments)
  File "/opt/marie-icr/marie/document/trocr_icr_processor.py", line 251, in recognize_from_fragments
    raise ex
  File "/opt/marie-icr/marie/document/trocr_icr_processor.py", line 232, in recognize_from_fragments
    predictions, scores = get_text(
  File "/opt/marie-icr/marie/document/trocr_icr_processor.py", line 122, in get_text
    results = task.inference_step(

2022-08-15 09:34:20,276 DEBG 'wsgi-app' stdout output:
  File "/opt/venv/lib/python3.8/site-packages/fairseq-0.12.2-py3.8-linux-x86_64.egg/fairseq/tasks/fairseq_task.py", line 542, in inference_step
    return generator.generate(
  File "/opt/venv/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/venv/lib/python3.8/site-packages/fairseq-0.12.2-py3.8-linux-x86_64.egg/fairseq/sequence_generator.py", line 204, in generate
    return self._generate(sample, **kwargs)
  File "/opt/marie-icr/marie/models/unilm/trocr/generator.py", line 144, in _generate
    lprobs, avg_attn_scores = self.model.forward_decoder(
  File "/opt/venv/lib/python3.8/site-packages/fairseq-0.12.2-py3.8-linux-x86_64.egg/fairseq/sequence_generator.py", line 819, in forward_decoder
    decoder_out = model.decoder.forward(
  File "/opt/venv/lib/python3.8/site-packages/fairseq-0.12.2-py3.8-linux-x86_64.egg/fairseq/models/transformer/transformer_decoder.py", line 217, in forward
    x, extra = self.extract_features(
  File "/opt/venv/lib/python3.8/site-packages/fairseq-0.12.2-py3.8-linux-x86_64.egg/fairseq/models/transformer/transformer_decoder.py", line 239, in extract_features
    return self.extract_features_scriptable(
  File "/opt/venv/lib/python3.8/site-packages/fairseq-0.12.2-py3.8-linux-x86_64.egg/fairseq/models/transformer/transformer_decoder.py", line 340, in extract_features_scriptable
    x, layer_attn, _ = layer(
  File "/opt/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/venv/lib/python3.8/site-packages/fairseq-0.12.2-py3.8-linux-x86_64.egg/fairseq/modules/transformer_layer.py", line 487, in forward
    x, attn = self.encoder_attn(
  File "/opt/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/venv/lib/python3.8/site-packages/fairseq-0.12.2-py3.8-linux-x86_64.egg/fairseq/modules/multihead_attention.py", line 593, in forward
    k = self.k_proj(key)
  File "/opt/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/venv/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA out of memory. Tried to allocate 362.00 MiB (GPU 0; 47.54 GiB total capacity; 38.29 GiB already allocated; 356.94 MiB free; 44.88 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
create-issue-branch[bot] commented 2 years ago

Branch issue-27-Cuda_Error_Out_of_memory created!

gregbugaj commented 1 year ago

Another instance is when the Overlay is being processes

I suspect this is issue with TorchVision

Creating overlay for : segment > /tmp/segment.png
dst_file_name : /tmp/form-segmentation/segment/dataroot_overlay/overlay_segment.png
opt.preprocess = none
dataset [SingleDataset] was created
__extract_segmentation_mask in 0.19 seconds
Segmented in 0.24 seconds
Traceback (most recent call last):
  File "/home/greg/environment/marie/lib/python3.10/site-packages/gradio/routes.py", line 298, in run_predict
    output = await app.blocks.process_api(
  File "/home/greg/environment/marie/lib/python3.10/site-packages/gradio/blocks.py", line 790, in process_api
    result = await self.call_function(fn_index, inputs, iterator)
  File "/home/greg/environment/marie/lib/python3.10/site-packages/gradio/blocks.py", line 697, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/greg/environment/marie/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/greg/environment/marie/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/greg/environment/marie/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/greg/dev/marieai/marie-ai/workspaces/overlay-gradio/./app.py", line 16, in process_image
    real, fake, blended = overlay_processor.segment(docId, src_img_path)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/greg/dev/marieai/marie-ai/marie/overlay/overlay.py", line 248, in segment
    fake_mask = self.__extract_segmentation_mask(
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/greg/dev/marieai/marie-ai/marie/overlay/overlay.py", line 129, in __extract_segmentation_mask
    for i, data in enumerate(dataset):
  File "/home/greg/dev/marieai/marie-ai/marie/models/pix2pix/data/__init__.py", line 101, in __iter__
    for i, data in enumerate(self.dataloader):
  File "/home/greg/environment/marie/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 635, in __next__
    data = self._next_data()
  File "/home/greg/environment/marie/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 679, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/greg/environment/marie/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/greg/environment/marie/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/greg/dev/marieai/marie-ai/marie/models/pix2pix/data/single_dataset.py", line 54, in __getitem__
    A = self.transform(tensor_image.cuda())
  File "/home/greg/environment/marie/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/home/greg/environment/marie/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
    def forward(self, input):
        for module in self:
            input = module(input)
                    ~~~~~~ <--- HERE
        return input
  File "/home/greg/environment/marie/lib/python3.10/site-packages/torchvision/transforms/transforms.py", line 270, in forward
            Tensor: Normalized Tensor image.
        """
        return F.normalize(tensor, self.mean, self.std, self.inplace)
               ~~~~~~~~~~~ <--- HERE
  File "/home/greg/environment/marie/lib/python3.10/site-packages/torchvision/transforms/functional.py", line 363, in normalize
        raise TypeError(f"img should be Tensor Image. Got {type(tensor)}")

    return F_t.normalize(tensor, mean=mean, std=std, inplace=inplace)
           ~~~~~~~~~~~~~ <--- HERE
  File "/home/greg/environment/marie/lib/python3.10/site-packages/torchvision/transforms/functional_tensor.py", line 911, in normalize

    if not inplace:
        tensor = tensor.clone()
                 ~~~~~~~~~~~~ <--- HERE

    dtype = tensor.dtype
RuntimeError: CUDA error: operation failed due to a previous error during capture
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.