OOM Error in Gemini 2 2B TFLite Conversion with Quantization on 80GB RAM

KennethanCeyer commented 1 month ago

I’m encountering out-of-memory (OOM) errors when attempting to convert Gemini 2 2b it using the AI Edge Torch conversion process on a Google Colab A100 runtime with approximately 80GB of RAM. The memory size here refers to regular memory, not GPU memory. (I understand that AI Edge Torch is currently not utilizing the GPU).

Despite following the MediaPipe user guidelines, which recommend using TensorFlow Lite (TFLite) conversion, there’s no clear documentation specifying the required hardware specifications to ensure smooth operation without errors.

Although the minimum requirement of 32GB is listed in this document, I'm encountering OOM issues even in an 80GB environment when converting the minimal number of parameter model for Gemma 2.

This has made it challenging to complete the conversion process. I’d appreciate any insights into the recommended hardware specs for handling the conversion of Gemini 2 2b it, especially regarding memory and GPU requirements.

Current Colab Setup:

Colab Pro
A100 GPU with 80GB memory

Colab link: https://colab.research.google.com/drive/19h3SZBiWuGqqtddHbF5MzOFGbPahXPIv?usp=sharing

AI Edge Conversion Code: https://github.com/KennethanCeyer/gemma2-to-tflite/blob/main/convert.py

haozha111 commented 1 month ago

the OOM probably happens at quantization step. We've done some significant memory reduction on converter part, but we recently found quantization will take another huge chunk of memory space. As a first step, can you try to remove quant_config from the conversion step and see if you can get a float tflite model without issues? thanks!

KennethanCeyer commented 1 month ago

@haozha111

I’ll share the memory usage result with quantization disabled as you mentioned above. Since the goal of converting to tflite is for Edge ML serving, we’ll need a solution for the OOM issue in the quantization step, if that indeed turns out to be the cause. (I’ll update the title accordingly when we reach that point.)

If the issue is confirmed to be quantization-related, it would be helpful to get guidance on whether it should be addressed within this project through a PR or tackled separately in a dedicated PyTorch quantization project. Before that, I’ll leave a comment once we confirm if the OOM issue is resolved with quantization disabled.

Thank you for your prompt response.

KennethanCeyer commented 1 month ago

After removing the quantization option, the conversion completed successfully without being killed by an OOM error. The process used up to 58GB of memory at its peak, but the memory was properly released after the conversion.

It’s clear that quantization had an impact, but I noticed that the conversion process itself still consumes a significant amount of memory. Therefore, I believe resolving the overall memory usage and OOM issues during TFLite conversion and quantization is still important, especially for edge serving, which aligns with the original topic of this issue.

About Memory Usage

It seems that the memory increases during the signature handling and bundling process, or when weights are updated during Save and Load in the TFLite conversion process. This might be due to the lack of proper garbage collection (GC), and I think more detailed debugging is needed to confirm this.

haozha111 commented 1 month ago

Thanks for sharing the latest info.

For the 50+GB memory usage inside TF Lite converter step, which TF version are you currently using? if you are installing the dependencies based on https://github.com/google-ai-edge/ai-edge-torch/blob/84c501503eea48129be9a8b369c5f1f5b6e89e00/requirements.txt#L9 (which is 0722 version), the converter memory fixes are not fully included in this TF version. You probably need to update the tf-nightly version to a version number after September and see if that helps reduce the memory usage. Let me know if you find the converter memory reduces after this change.

KennethanCeyer commented 1 month ago

The device used for testing follows the dependency versions specified in the requirements.txt file of the google-ai-edge/ai-edge-torch repository.

Today, I ran tests under the same conditions (with quantization disabled) using the updated tf-nightly>=2.18.0.dev20240905 package version. However, unfortunately, this did not lead to any performance improvements.

Instead, the peak memory usage increased to 83.1GB.

It seems that further detailed profiling and optimization of the memory usage during these stages will be necessary moving forward.

haozha111 commented 1 month ago

Thanks for the info. We will do more analysis here and get back to you.

google-ai-edge / ai-edge-torch

OOM Error in Gemini 2 2B TFLite Conversion with Quantization on 80GB RAM #192