huggingface / exporters

Export Hugging Face models to Core ML and TensorFlow Lite
Apache License 2.0
572 stars 35 forks source link

Converting llama-2-7b failed #61

Closed TimYao18 closed 8 months ago

TimYao18 commented 8 months ago

It runs well until UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown Then I saw the Activity Monitor the python is stop running.

How to fix this?

(LLM_env) tim@TPE exporters % python -m exporters.coreml --model=/Users/tim/GitLab/survey/LLM/llama-meta/Llama-2-7b-hf exported/ 
Torch version 2.0.1 has not been tested with coremltools. You may run into unexpected errors. Torch 2.0.0 is the most recent version that has been tested.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:02<00:00, 31.31s/it]
Using framework PyTorch: 2.0.1
Overriding 1 configuration item(s)
        - use_cache -> False
/Users/tim/GitLab/survey/LLM/LLM_env/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py:808: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if input_shape[-1] > 1:
/Users/tim/GitLab/survey/LLM/LLM_env/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py:146: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if seq_len > self.max_seq_len_cached:
/Users/tim/GitLab/survey/LLM/LLM_env/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py:375: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
/Users/tim/GitLab/survey/LLM/LLM_env/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py:382: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
/Users/tim/GitLab/survey/LLM/LLM_env/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py:392: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
Skipping token_type_ids input
Converting PyTorch Frontend ==> MIL Ops:   0%|                                                                                                                                                     | 0/3627 [00:00<?, ? ops/s]Saving value type of int64 into a builtin type of int32, might lose precision!
Saving value type of int64 into a builtin type of int32, might lose precision!
Converting PyTorch Frontend ==> MIL Ops: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 3626/3627 [00:01<00:00, 3155.13 ops/s]
Running MIL frontend_pytorch pipeline: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 18.10 passes/s]
Running MIL default pipeline:  15%|██████████████████████▋                                                                                                                               | 10/66 [00:01<00:05, 10.96 passes/s]/Users/tim/GitLab/survey/LLM/LLM_env/lib/python3.11/site-packages/coremltools/converters/mil/mil/passes/defs/preprocess.py:267: UserWarning: Output, '4530', of the source model, has been renamed to 'var_4530' in the Core ML model.
  warnings.warn(msg.format(var.name, new_name))
Running MIL default pipeline:  77%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                  | 51/66 [03:23<01:40,  6.70s/ passes]/Users/tim/GitLab/survey/LLM/LLM_env/lib/python3.11/site-packages/coremltools/converters/mil/mil/ops/defs/iOS15/elementwise_unary.py:894: RuntimeWarning: overflow encountered in cast
  return input_var.val.astype(dtype=string_to_nptype(dtype_val))
/Users/tim/GitLab/survey/LLM/LLM_env/lib/python3.11/site-packages/coremltools/converters/mil/mil/ops/defs/iOS15/elementwise_unary.py:896: RuntimeWarning: overflow encountered in cast
  return np.array(input_var.val).astype(dtype=string_to_nptype(dtype_val))
Running MIL default pipeline: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 66/66 [09:44<00:00,  8.86s/ passes]
Running MIL backend_mlprogram pipeline: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 65.90 passes/s]
zsh: killed     python -m exporters.coreml  exported/
(LLM_env) tim@TPE exporters % /Users/tim/.pyenv/versions/3.11.5/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
pcuenca commented 8 months ago

Hi @TimYao18, that message is just a warning. Did you kill the process yourself?

TimYao18 commented 8 months ago

It takes over 300 minutes hung here and seems no python process running from Activity Monitor, so I stop it.

I will run it again today and see if it runs well.

TimYao18 commented 8 months ago

This time it crash. I only change the source code :

if __name__ == "__main__":
    logger = logging.get_logger("exporters.coreml")  # pylint: disable=invalid-name
    logger.setLevel(logging.DEBUG) # <--- replace INFO with DEBUG
    main()

the output:

Torch version 2.0.1 has not been tested with coremltools. You may run into unexpected errors. Torch 2.0.0 is the most recent version that has been tested.
Loading checkpoint shards: 100%|██████████████████| 2/2 [01:03<00:00, 31.93s/it]
Using framework PyTorch: 2.0.1
Overriding 1 configuration item(s)
    - use_cache -> False
/Users/tim/GitLab/survey/LLM/LLM_env/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py:808: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if input_shape[-1] > 1:
/Users/tim/GitLab/survey/LLM/LLM_env/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py:146: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if seq_len > self.max_seq_len_cached:
/Users/tim/GitLab/survey/LLM/LLM_env/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py:375: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
/Users/tim/GitLab/survey/LLM/LLM_env/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py:382: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
/Users/tim/GitLab/survey/LLM/LLM_env/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py:392: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
Skipping token_type_ids input
Converting PyTorch Frontend ==> MIL Ops:   0%|       | 0/3627 [00:00<?, ? ops/s]Saving value type of int64 into a builtin type of int32, might lose precision!
Saving value type of int64 into a builtin type of int32, might lose precision!
Converting PyTorch Frontend ==> MIL Ops: 100%|▉| 3626/3627 [00:01<00:00, 3294.98
Running MIL frontend_pytorch pipeline: 100%|█| 5/5 [00:00<00:00, 17.99 passes/s]
Running MIL default pipeline:  15%|█▏      | 10/66 [00:01<00:05, 11.00 passes/s]/Users/tim/GitLab/survey/LLM/LLM_env/lib/python3.11/site-packages/coremltools/converters/mil/mil/passes/defs/preprocess.py:267: UserWarning: Output, '4530', of the source model, has been renamed to 'var_4530' in the Core ML model.
  warnings.warn(msg.format(var.name, new_name))
Running MIL default pipeline:  77%|██████▏ | 51/66 [03:10<01:40,  6.68s/ passes]/Users/tim/GitLab/survey/LLM/LLM_env/lib/python3.11/site-packages/coremltools/converters/mil/mil/ops/defs/iOS15/elementwise_unary.py:894: RuntimeWarning: overflow encountered in cast
  return input_var.val.astype(dtype=string_to_nptype(dtype_val))
/Users/tim/GitLab/survey/LLM/LLM_env/lib/python3.11/site-packages/coremltools/converters/mil/mil/ops/defs/iOS15/elementwise_unary.py:896: RuntimeWarning: overflow encountered in cast
...
  File "/Users/tim/GitLab/survey/LLM/LLM_env/lib/python3.11/site-packages/coremltools/converters/mil/backend/mil/helper.py", line 308, in _get_offset_by_writing_data
    offset = blob_writer.write_fp16_data(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: [MIL FileWriter]: Unknown error occured while writing data to the file.
TimYao18 commented 8 months ago

I tried 3 times on my MacBook Air M1 with different Log option or compute-unit but met the same error. Would you provide the python version you use for reference? I think it might be python issue?

I also tried it on Macbook Air M2, it runs finished as log. log.txt

The last line is Running MIL backend_mlprogram pipeline: 100%|█| 12/12 [00:00<00:00, 103.12 passe. It seems end at ct.convert() because I had add print() after that.

Environment: Macbook Air M2: python 3.9.6, coremltools 7.0, macOS sonoma 14.0 Macbook Air M1: python 3.11.5, coremltools 7.0, macOS sonoma 14.0

TimYao18 commented 8 months ago

I tried the LLaMA-2-7b-chat, it converted done. I think there might be something wrong on LLaMA-2-7b

pcuenca commented 8 months ago

I tried the LLaMA-2-7b-chat, it converted done. I think there might be something wrong on LLaMA-2-7b

Ohh, that's a good hint, I usually test conversion of chat/instruct models. I'll verify both and will let you know how it goes.

Can I ask what's your use case for converting the base model rather than the chat version?

TimYao18 commented 8 months ago

I don't know what's the different from base model and chat model. I only know that Chat model is the fine-tuned LLaMA model, but when using it, it looks like no different from them?

TimYao18 commented 8 months ago

Today I run it again with no luck that the conversion seems not done at ct.convert() in convert.py.

I guess maybe it's the memory issue that my MacBook only has 16GB ram?

pcuenca commented 8 months ago

Hi @TimYao18!

I've been testing and I've been able to convert both the chat and the base models. RAM could be an issue, yes: I saw > 50 GB RAM usage during conversion. This doesn't mean that you won't be able to convert it yourself, but your system may swap and progress could become extremely slow.

Another reason that could explain failures is disk space. During conversion, there are lots of temporary files and caches written to disk that take up a lot of space. Please, ensure you have plenty of free storage before attempting conversion of these models.

If you want to use a pre-converted model in your app, you can test this model that I converted myself.

TimYao18 commented 8 months ago

I changed my device to 32 GB Ram Windows and the RAM is still not enough that process be killed. So I tried the quantized model Llama-2-7b-Chat-GPTQ. But it will show up the errors. Is the errors mean I have to modify the quantized model but not the exporters source code?

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/c/Users/timya/Downloads/exporters-main/src/exporters/coreml/__main__.py", line 178, in <module>
    main()
  File "/mnt/c/Users/timya/Downloads/exporters-main/src/exporters/coreml/__main__.py", line 138, in main
    model = FeaturesManager.get_model_from_feature(
  File "/mnt/c/Users/timya/Downloads/exporters-main/src/exporters/coreml/features.py", line 477, in get_model_from_feature
    model = model_class.from_pretrained(model, cache_dir=cache_dir, torchscript=True)
  File "/mnt/c/Users/timya/Downloads/llama-main/llama_env/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
    return model_class.from_pretrained(
  File "/mnt/c/Users/timya/Downloads/llama-main/llama_env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3155, in from_pretrained
    model = quantizer.convert_model(model)
  File "/mnt/c/Users/timya/Downloads/llama-main/llama_env/lib/python3.10/site-packages/optimum/gptq/quantizer.py", line 171, in convert_model
    self.block_name_to_quantize = get_block_name_with_pattern(model)
  File "/mnt/c/Users/timya/Downloads/llama-main/llama_env/lib/python3.10/site-packages/optimum/gptq/utils.py", line 77, in get_block_name_with_pattern
    raise ValueError("Block pattern could not be match. Pass `block_name_to_quantize` argument in `quantize_model`")
ValueError: Block pattern could not be match. Pass `block_name_to_quantize` argument in `quantize_model`
TimYao18 commented 8 months ago

I convert the model on Windows with 64GB ram and it works very well. Thank you.

HaoFYu commented 4 months ago

I changed my device to 32 GB Ram Windows and the RAM is still not enough that process be killed. So I tried the quantized model Llama-2-7b-Chat-GPTQ. But it will show up the errors. Is the errors mean I have to modify the quantized model but not the exporters source code?

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/c/Users/timya/Downloads/exporters-main/src/exporters/coreml/__main__.py", line 178, in <module>
    main()
  File "/mnt/c/Users/timya/Downloads/exporters-main/src/exporters/coreml/__main__.py", line 138, in main
    model = FeaturesManager.get_model_from_feature(
  File "/mnt/c/Users/timya/Downloads/exporters-main/src/exporters/coreml/features.py", line 477, in get_model_from_feature
    model = model_class.from_pretrained(model, cache_dir=cache_dir, torchscript=True)
  File "/mnt/c/Users/timya/Downloads/llama-main/llama_env/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
    return model_class.from_pretrained(
  File "/mnt/c/Users/timya/Downloads/llama-main/llama_env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3155, in from_pretrained
    model = quantizer.convert_model(model)
  File "/mnt/c/Users/timya/Downloads/llama-main/llama_env/lib/python3.10/site-packages/optimum/gptq/quantizer.py", line 171, in convert_model
    self.block_name_to_quantize = get_block_name_with_pattern(model)
  File "/mnt/c/Users/timya/Downloads/llama-main/llama_env/lib/python3.10/site-packages/optimum/gptq/utils.py", line 77, in get_block_name_with_pattern
    raise ValueError("Block pattern could not be match. Pass `block_name_to_quantize` argument in `quantize_model`")
ValueError: Block pattern could not be match. Pass `block_name_to_quantize` argument in `quantize_model`

image I had the same mistake. How did you fix it?