google-ai-edge / ai-edge-torch

Supporting PyTorch models with the Google AI Edge TFLite runtime.
Apache License 2.0
301 stars 40 forks source link

Not able to convert Llama 3.2 1B Instruct to Tflite format #269

Open atultiwari opened 1 day ago

atultiwari commented 1 day ago

Description of the bug:

I am using Google Colab Pro+ (with High RAM) to convert Llama 3.2 1B Instruct model to Tflite format (for later use in mediapipe android app). For that

  1. I downloaded the safetensor file from unsloth huggingface (link).
  2. I updated the convert script with the path of the downloaded safetensor file
  3. I solved torch-xla related issue by using following install - !pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/cuda/12.4/torch_xla-2.4.0-cp310-cp310-linux_x86_64.whl

However, now I am getting error flatbuffers.builder.BuilderSizeError: flatbuffers: cannot grow buffer beyond 2 gigabytes.

If it helps - Link of the Google Colab Notebook

Actual vs expected behavior:

Expected behavior

Actual behavior

Any other information you'd like to share?

Error Log

/content/ai-edge-torch/ai_edge_torch/generative/examples/llama 2024-09-29 19:20:17.359533: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variableTF_ENABLE_ONEDNN_OPTS=0. 2024-09-29 19:20:17.377103: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1727637617.398465 5173 cuda_dnn.cc:8312] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1727637617.405059 5173 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-09-29 19:20:17.426425: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. /usr/local/lib/python3.10/dist-packages/torch_xla/__init__.py:202: UserWarning:tensorflowcan conflict withtorch-xla. Prefertensorflow-cpuwhen using PyTorch/XLA. To silence this warning,pip uninstall -y tensorflow && pip install tensorflow-cpu. If you are in a notebook environment such as Colab or Kaggle, restart your notebook runtime afterwards. warnings.warn( /usr/local/lib/python3.10/dist-packages/torch/_subclasses/functional_tensor.py:362: UserWarning: At pre-dispatch tracing, we will assume that any custom op that is marked with CompositeImplicitAutograd and functional are safe to not decompose. We found xla.mark_tensor.default to be one such op. warnings.warn( /usr/local/lib/python3.10/dist-packages/torch/_subclasses/functional_tensor.py:362: UserWarning: At pre-dispatch tracing, we will assume that any custom op that is marked with CompositeImplicitAutograd and functional are safe to not decompose. We found xla.mark_tensor.default to be one such op. warnings.warn( W0929 19:22:15.024696 133865848435328 runtime.py:42] PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md W0929 19:22:15.024884 133865848435328 runtime.py:59] Defaulting to PJRT_DEVICE=CPU WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1727637735.028522 5173 cpu_client.cc:467] TfrtCpuClient created. 2024-09-29 19:22:37.086796: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:47] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0. I0000 00:00:1727637757.086944 5173 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38554 MB memory: -> device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:00:04.0, compute capability: 8.0 I0929 19:22:41.535187 133865848435328 signature_serialization.py:156] Functioninnercontains input name(s) resource with unsupported characters which will be renamed to xlacallmodule_readvariableop_117_resource in the SavedModel. I0929 19:22:41.651360 133865848435328 signature_serialization.py:156] Functioninnercontains input name(s) resource with unsupported characters which will be renamed to xlacallmodule_readvariableop_117_resource in the SavedModel. I0929 19:22:42.652768 133865848435328 functional_saver.py:440] Sharding callback duration: 67 I0929 19:22:46.771306 133865848435328 functional_saver.py:440] Sharding callback duration: 105 INFO:tensorflow:Assets written to: /tmp/tmphr1fv8ev/assets I0929 19:22:58.322300 133865848435328 builder_impl.py:836] Assets written to: /tmp/tmphr1fv8ev/assets I0929 19:22:58.358078 133865848435328 fingerprinting_utils.py:49] Writing fingerprint to /tmp/tmphr1fv8ev/fingerprint.pb WARNING: All log messages before absl::InitializeLog() is called are written to STDERR W0000 00:00:1727637787.839623 5173 tf_tfl_flatbuffer_helpers.cc:365] Ignored output_format. W0000 00:00:1727637787.839659 5173 tf_tfl_flatbuffer_helpers.cc:368] Ignored drop_control_dependency. 2024-09-29 19:23:07.840485: I tensorflow/cc/saved_model/reader.cc:83] Reading SavedModel from: /tmp/tmphr1fv8ev 2024-09-29 19:23:07.847678: I tensorflow/cc/saved_model/reader.cc:52] Reading meta graph with tags { serve } 2024-09-29 19:23:07.847723: I tensorflow/cc/saved_model/reader.cc:147] Reading SavedModel debug info (if present) from: /tmp/tmphr1fv8ev I0000 00:00:1727637787.889566 5173 mlir_graph_optimization_pass.cc:401] MLIR V1 optimization pass is not enabled 2024-09-29 19:23:07.895510: I tensorflow/cc/saved_model/loader.cc:236] Restoring SavedModel bundle. 2024-09-29 19:23:10.512704: I tensorflow/cc/saved_model/loader.cc:220] Running initialization op on SavedModel bundle at path: /tmp/tmphr1fv8ev 2024-09-29 19:23:10.591254: I tensorflow/cc/saved_model/loader.cc:466] SavedModel load for tags { serve }; Status: success: OK. Took 2750774 microseconds. 2024-09-29 19:23:10.649508: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env varMLIR_CRASH_REPRODUCER_DIRECTORYto enable. 2024-09-29 19:32:25.463338: I tensorflow/compiler/mlir/lite/flatbuffer_export.cc:3893] Estimated count of arithmetic ops: 2586.261 G ops, equivalently 1293.130 G MACs Traceback (most recent call last): File "/content/ai-edge-torch/ai_edge_torch/generative/examples/llama/convert_to_tflite.py", line 68, in <module> app.run(main) File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 308, in run _run_main(main, args) File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "/content/ai-edge-torch/ai_edge_torch/generative/examples/llama/convert_to_tflite.py", line 59, in main converter.convert_to_tflite( File "/content/ai-edge-torch/ai_edge_torch/generative/utilities/converter.py", line 62, in convert_to_tflite ai_edge_torch.signature( File "/content/ai-edge-torch/ai_edge_torch/_convert/converter.py", line 163, in convert return conversion.convert_signatures( File "/content/ai-edge-torch/ai_edge_torch/_convert/conversion.py", line 105, in convert_signatures tflite_model = lowertools.exported_programs_to_tflite( File "/content/ai-edge-torch/ai_edge_torch/lowertools/_shim.py", line 75, in exported_programs_to_tflite return utils.merged_bundle_to_tfl_model( File "/content/ai-edge-torch/ai_edge_torch/lowertools/torch_xla_utils.py", line 280, in merged_bundle_to_tfl_model tflite_model = translate_recipe.quantize_model( File "/content/ai-edge-torch/ai_edge_torch/lowertools/translate_recipe.py", line 162, in quantize_model result = qt.quantize() File "/usr/local/lib/python3.10/dist-packages/ai_edge_quantizer/quantizer.py", line 243, in quantize quantized_model = self._get_quantized_model(quant_params) File "/usr/local/lib/python3.10/dist-packages/ai_edge_quantizer/quantizer.py", line 331, in _get_quantized_model return model_modifier_instance.modify_model(quant_params) File "/usr/local/lib/python3.10/dist-packages/ai_edge_quantizer/model_modifier.py", line 85, in modify_model return self._serialize_small_model(quantized_model) File "/usr/local/lib/python3.10/dist-packages/ai_edge_quantizer/model_modifier.py", line 178, in _serialize_small_model model_bytearray = flatbuffer_utils.convert_object_to_bytearray( File "/usr/local/lib/python3.10/dist-packages/tensorflow/lite/tools/flatbuffer_utils.py", line 122, in convert_object_to_bytearray model_offset = model_object.Pack(builder) File "/usr/local/lib/python3.10/dist-packages/tensorflow/lite/python/schema_py_generated.py", line 18390, in Pack bufferslist.append(self.buffers[i].Pack(builder)) File "/usr/local/lib/python3.10/dist-packages/tensorflow/lite/python/schema_py_generated.py", line 17650, in Pack data = builder.CreateNumpyVector(self.data) File "/usr/local/lib/python3.10/dist-packages/flatbuffers/builder.py", line 503, in CreateNumpyVector self.StartVector(x.itemsize, x.size, x.dtype.alignment) File "/usr/local/lib/python3.10/dist-packages/flatbuffers/builder.py", line 400, in StartVector self.Prep(N.Uint32Flags.bytewidth, elemSize*numElems) File "/usr/local/lib/python3.10/dist-packages/flatbuffers/builder.py", line 354, in Prep self.growByteBuffer() File "/usr/local/lib/python3.10/dist-packages/flatbuffers/builder.py", line 303, in growByteBuffer raise BuilderSizeError(msg) flatbuffers.builder.BuilderSizeError: flatbuffers: cannot grow buffer beyond 2 gigabytes I0000 00:00:1727638470.142136 5173 cpu_client.cc:470] TfrtCpuClient destroyed.

pkgoogle commented 4 hours ago

Hi @atultiwari, can you gain access to the "official" data? https://huggingface.co/meta-llama/Llama-3.2-3B ... I'm not fully sure what the differences between the official one and unsloth version are but we will likely run into less issues with this route. There's also a specific convert_3b_to_tflite.py script now available, perhaps that will resolve your issue: llama example