google-ai-edge / ai-edge-torch

Supporting PyTorch models with the Google AI Edge TFLite runtime.
Apache License 2.0
310 stars 40 forks source link

Phi3 conversion OOM on A100 #44

Open a8nova opened 3 months ago

a8nova commented 3 months ago

Description of the bug:

I wanted to convert phi3, I made the necessary changes in my own fork https://github.com/google-ai-edge/ai-edge-torch/compare/main...a8nova:ai-edge-torch:phi3 but OOM killer is nuking my process

Full error attached: phi3_conversion_error.txt

Actual vs expected behavior:

The OOM is nuking the conversion script on a Colab A100.

Any other information you'd like to share?

  1. Is there anything wrong in the phi3 re-authoring? All changes can be viewed here: https://github.com/google-ai-edge/ai-edge-torch/compare/main...a8nova:ai-edge-torch:phi3
  2. Is there anything I can do to get it to convert? (e.g. changing parameters to make it memory efficient..)
  3. Debugging tips?
haozha111 commented 3 months ago

Hi @a8nova thanks for reporting the issue!

There is a known issue for high memory usage during the conversion process, which may kill the conversion script. Which phi-3 version are you converting? What's the size of phi-3 checkpoint you are using? A colab free instance may only have 12GB free RAM, which isn't enough. Do you happen to have: 1) A colab pro subscription 2) A Linux workstation (or on cloud) which has over 50GB of memory?

We are still actively working on fixing the memory issue, and sorry for the inconvenience!

haozha111 commented 3 months ago

Also from the conversion log, it seems the memory consumption is from CUDA. Are you able to try CUDA_VISIBLE_DEVICES=-1 to disable GPU memory allocation? The conversion only needs to consume CPU memory.

a8nova commented 3 months ago

Hi @haozha111 - Thank you for the quick response.

Let me try setting CUDA_VISIBLE_DEVICES

a8nova commented 3 months ago

I am also getting OOM when running with CUDA_VISIBLE_DEVICES=-1 on a box with 53GB system RAM

env: CUDA_VISIBLE_DEVICES=-1
/content/ai-edge-torch/ai_edge_torch/generative/examples/phi3
2024-06-10 20:14:26.133314: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-10 20:14:26.577974: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-06-10 20:14:28.834951: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = os.fork()
WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
WARNING:root:Defaulting to PJRT_DEVICE=CPU
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1718050475.549140   17434 cpu_client.cc:424] TfrtCpuClient created.
WARNING:root:Your model "prefill" is converted in training mode. Please set the module in evaluation mode with `module.eval()` for better on-device performance and compatibility.
WARNING:root:Your model "decode" is converted in training mode. Please set the module in evaluation mode with `module.eval()` for better on-device performance and compatibility.
2024-06-10 20:18:36.252751: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2024-06-10 20:18:36.252876: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:135] retrieving CUDA diagnostic information for host: 055cf236c060
2024-06-10 20:18:36.252891: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:142] hostname: 055cf236c060
2024-06-10 20:18:36.253133: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:166] libcuda reported version is: 535.104.5
2024-06-10 20:18:36.253165: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:170] kernel reported version is: 535.104.5
2024-06-10 20:18:36.253176: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:249] kernel version seems to match DSO: 535.104.5
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
W0000 00:00:1718051375.007065   17434 tf_tfl_flatbuffer_helpers.cc:392] Ignored output_format.
W0000 00:00:1718051375.010046   17434 tf_tfl_flatbuffer_helpers.cc:395] Ignored drop_control_dependency.
2024-06-10 20:29:35.016643: I tensorflow/cc/saved_model/reader.cc:83] Reading SavedModel from: /tmp/tmpil1idwz1
2024-06-10 20:29:35.028233: I tensorflow/cc/saved_model/reader.cc:52] Reading meta graph with tags { serve }
2024-06-10 20:29:35.028277: I tensorflow/cc/saved_model/reader.cc:147] Reading SavedModel debug info (if present) from: /tmp/tmpil1idwz1
2024-06-10 20:29:35.126021: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2024-06-10 20:29:35.139828: I tensorflow/cc/saved_model/loader.cc:236] Restoring SavedModel bundle.
^C
[ ]
Colab paid products - Cancel contracts here
You are subscribed to Colab Pro. Learn more
Available: 25.31 compute units
Usage rate: approximately 4.82 per hour
You have 1 active session.
Python 3 Google Compute Engine backend (GPU)
Showing resources from 10:20 PM to 11:31 PM
System RAM
1.5 / 53.0 GB

GPU RAM
0.0 / 22.5 GB

Disk
68.1 / 201.2 GB
haozha111 commented 3 months ago

got it, do you mind to update your branch w/ phi-3, and we can fork and try converting it. thanks!

a8nova commented 3 months ago

Changes in the phi3 branch are up to date, you should be able to checkout and run the conversion script. Note that I also had to make changes to loader.py and feed_forward.py. Please let me know if you run into any issues. Thank you!

a8nova commented 3 months ago

Hi @haozha111 @vamsimanchala - Any updates on this? Thanks!

haozha111 commented 3 months ago

hi @a8nova , we are making good progress on this issue, and it requires some fixes in our converter stack. We plan to give an update on this issue soon in the coming weeks, thanks for your patience!

mitsunami commented 3 months ago

Hi, I am also encountering the same issue. Although I cannot share the model details, it appears to be getting killed at the same point as seen in the logs above. I am looking forward to a fix for this issue. Thanks!

haozha111 commented 3 months ago

hi @mitsunami ,

Are you trying to convert from colab pro instance, or a local Linux workstation, and how much memory do you have?

We are making great progress on reducing the converter memory issue and we will give an update on this issue soon, thanks for your patience!

mitsunami commented 3 months ago

Hi @haozha111, I'm trying that on a local desktop with 64 GB RAM. Looking forward to an update. Thanks!

vamsimanchala commented 3 months ago

Hi @mitsunami, We recently landed some changes. Can you please exercise the conversion to TFLite and let us know if things look good.

Thank you for your patience, Vamsi Manchala