Not able to convert Llama 3.2 1B Instruct to Tflite format

Description of the bug:

I am using Google Colab Pro+ (with High RAM) to convert Llama 3.2 1B Instruct model to Tflite format (for later use in mediapipe android app). For that

I downloaded the safetensor file from unsloth huggingface (link).
I updated the convert script with the path of the downloaded safetensor file
I solved torch-xla related issue by using following install - !pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/cuda/12.4/torch_xla-2.4.0-cp310-cp310-linux_x86_64.whl

However, now I am getting error flatbuffers.builder.BuilderSizeError: flatbuffers: cannot grow buffer beyond 2 gigabytes.

If it helps - Link of the Google Colab Notebook

Actual vs expected behavior:

Expected behavior

Tflite File should have been made without any error

Actual behavior

I got the error as mentioned above and the error logs are as following.

Any other information you'd like to share?

Error Log

/content/ai-edge-torch/ai_edge_torch/generative/examples/llama 2024-09-29 19:20:17.359533: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variableTF_ENABLE_ONEDNN_OPTS=0. 2024-09-29 19:20:17.377103: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1727637617.398465 5173 cuda_dnn.cc:8312] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1727637617.405059 5173 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-09-29 19:20:17.426425: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. /usr/local/lib/python3.10/dist-packages/torch_xla/__init__.py:202: UserWarning:tensorflowcan conflict withtorch-xla. Prefertensorflow-cpuwhen using PyTorch/XLA. To silence this warning,pip uninstall -y tensorflow && pip install tensorflow-cpu. If you are in a notebook environment such as Colab or Kaggle, restart your notebook runtime afterwards. warnings.warn( /usr/local/lib/python3.10/dist-packages/torch/_subclasses/functional_tensor.py:362: UserWarning: At pre-dispatch tracing, we will assume that any custom op that is marked with CompositeImplicitAutograd and functional are safe to not decompose. We found xla.mark_tensor.default to be one such op. warnings.warn( /usr/local/lib/python3.10/dist-packages/torch/_subclasses/functional_tensor.py:362: UserWarning: At pre-dispatch tracing, we will assume that any custom op that is marked with CompositeImplicitAutograd and functional are safe to not decompose. We found xla.mark_tensor.default to be one such op. warnings.warn( W0929 19:22:15.024696 133865848435328 runtime.py:42] PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md W0929 19:22:15.024884 133865848435328 runtime.py:59] Defaulting to PJRT_DEVICE=CPU WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1727637735.028522 5173 cpu_client.cc:467] TfrtCpuClient created. 2024-09-29 19:22:37.086796: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:47] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0. I0000 00:00:1727637757.086944 5173 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38554 MB memory: -> device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:00:04.0, compute capability: 8.0 I0929 19:22:41.535187 133865848435328 signature_serialization.py:156] Functioninnercontains input name(s) resource with unsupported characters which will be renamed to xlacallmodule_readvariableop_117_resource in the SavedModel. I0929 19:22:41.651360 133865848435328 signature_serialization.py:156] Functioninnercontains input name(s) resource with unsupported characters which will be renamed to xlacallmodule_readvariableop_117_resource in the SavedModel. I0929 19:22:42.652768 133865848435328 functional_saver.py:440] Sharding callback duration: 67 I0929 19:22:46.771306 133865848435328 functional_saver.py:440] Sharding callback duration: 105 INFO:tensorflow:Assets written to: /tmp/tmphr1fv8ev/assets I0929 19:22:58.322300 133865848435328 builder_impl.py:836] Assets written to: /tmp/tmphr1fv8ev/assets I0929 19:22:58.358078 133865848435328 fingerprinting_utils.py:49] Writing fingerprint to /tmp/tmphr1fv8ev/fingerprint.pb WARNING: All log messages before absl::InitializeLog() is called are written to STDERR W0000 00:00:1727637787.839623 5173 tf_tfl_flatbuffer_helpers.cc:365] Ignored output_format. W0000 00:00:1727637787.839659 5173 tf_tfl_flatbuffer_helpers.cc:368] Ignored drop_control_dependency. 2024-09-29 19:23:07.840485: I tensorflow/cc/saved_model/reader.cc:83] Reading SavedModel from: /tmp/tmphr1fv8ev 2024-09-29 19:23:07.847678: I tensorflow/cc/saved_model/reader.cc:52] Reading meta graph with tags { serve } 2024-09-29 19:23:07.847723: I tensorflow/cc/saved_model/reader.cc:147] Reading SavedModel debug info (if present) from: /tmp/tmphr1fv8ev I0000 00:00:1727637787.889566 5173 mlir_graph_optimization_pass.cc:401] MLIR V1 optimization pass is not enabled 2024-09-29 19:23:07.895510: I tensorflow/cc/saved_model/loader.cc:236] Restoring SavedModel bundle. 2024-09-29 19:23:10.512704: I tensorflow/cc/saved_model/loader.cc:220] Running initialization op on SavedModel bundle at path: /tmp/tmphr1fv8ev 2024-09-29 19:23:10.591254: I tensorflow/cc/saved_model/loader.cc:466] SavedModel load for tags { serve }; Status: success: OK. Took 2750774 microseconds. 2024-09-29 19:23:10.649508: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env varMLIR_CRASH_REPRODUCER_DIRECTORYto enable. 2024-09-29 19:32:25.463338: I tensorflow/compiler/mlir/lite/flatbuffer_export.cc:3893] Estimated count of arithmetic ops: 2586.261 G ops, equivalently 1293.130 G MACs Traceback (most recent call last): File "/content/ai-edge-torch/ai_edge_torch/generative/examples/llama/convert_to_tflite.py", line 68, in <module> app.run(main) File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 308, in run _run_main(main, args) File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "/content/ai-edge-torch/ai_edge_torch/generative/examples/llama/convert_to_tflite.py", line 59, in main converter.convert_to_tflite( File "/content/ai-edge-torch/ai_edge_torch/generative/utilities/converter.py", line 62, in convert_to_tflite ai_edge_torch.signature( File "/content/ai-edge-torch/ai_edge_torch/_convert/converter.py", line 163, in convert return conversion.convert_signatures( File "/content/ai-edge-torch/ai_edge_torch/_convert/conversion.py", line 105, in convert_signatures tflite_model = lowertools.exported_programs_to_tflite( File "/content/ai-edge-torch/ai_edge_torch/lowertools/_shim.py", line 75, in exported_programs_to_tflite return utils.merged_bundle_to_tfl_model( File "/content/ai-edge-torch/ai_edge_torch/lowertools/torch_xla_utils.py", line 280, in merged_bundle_to_tfl_model tflite_model = translate_recipe.quantize_model( File "/content/ai-edge-torch/ai_edge_torch/lowertools/translate_recipe.py", line 162, in quantize_model result = qt.quantize() File "/usr/local/lib/python3.10/dist-packages/ai_edge_quantizer/quantizer.py", line 243, in quantize quantized_model = self._get_quantized_model(quant_params) File "/usr/local/lib/python3.10/dist-packages/ai_edge_quantizer/quantizer.py", line 331, in _get_quantized_model return model_modifier_instance.modify_model(quant_params) File "/usr/local/lib/python3.10/dist-packages/ai_edge_quantizer/model_modifier.py", line 85, in modify_model return self._serialize_small_model(quantized_model) File "/usr/local/lib/python3.10/dist-packages/ai_edge_quantizer/model_modifier.py", line 178, in _serialize_small_model model_bytearray = flatbuffer_utils.convert_object_to_bytearray( File "/usr/local/lib/python3.10/dist-packages/tensorflow/lite/tools/flatbuffer_utils.py", line 122, in convert_object_to_bytearray model_offset = model_object.Pack(builder) File "/usr/local/lib/python3.10/dist-packages/tensorflow/lite/python/schema_py_generated.py", line 18390, in Pack bufferslist.append(self.buffers[i].Pack(builder)) File "/usr/local/lib/python3.10/dist-packages/tensorflow/lite/python/schema_py_generated.py", line 17650, in Pack data = builder.CreateNumpyVector(self.data) File "/usr/local/lib/python3.10/dist-packages/flatbuffers/builder.py", line 503, in CreateNumpyVector self.StartVector(x.itemsize, x.size, x.dtype.alignment) File "/usr/local/lib/python3.10/dist-packages/flatbuffers/builder.py", line 400, in StartVector self.Prep(N.Uint32Flags.bytewidth, elemSize*numElems) File "/usr/local/lib/python3.10/dist-packages/flatbuffers/builder.py", line 354, in Prep self.growByteBuffer() File "/usr/local/lib/python3.10/dist-packages/flatbuffers/builder.py", line 303, in growByteBuffer raise BuilderSizeError(msg) flatbuffers.builder.BuilderSizeError: flatbuffers: cannot grow buffer beyond 2 gigabytes I0000 00:00:1727638470.142136 5173 cpu_client.cc:470] TfrtCpuClient destroyed.

google-ai-edge / ai-edge-torch