Exporter being killed - Githubissues

willswire commented 10 months ago

Similar to #61, my exporter process is being killed. I'd like to verify this is a resource constraint, and not an issue in project. I am running python3 -m exporters.coreml --model=mistralai/Mistral-7B-v0.1 mistral.mlpackage on a M3 MacBook Pro with 18GB of memory.

model-00001-of-00002.safetensors: 100%|████| 9.94G/9.94G [07:47<00:00, 21.3MB/s]
model-00002-of-00002.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.54G/4.54G [04:42<00:00, 16.1MB/s]
Downloading shards: 100%|████████████████████████| 2/2 [12:31<00:00, 375.71s/it]████████████████████████████████████████████████████████████████████████████████████████████████▉| 4.54G/4.54G [04:42<00:00, 16.7MB/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:25<00:00, 12.58s/it]
Using framework PyTorch: 2.1.0
Overriding 1 configuration item(s)
    - use_cache -> False
/opt/homebrew/lib/python3.11/site-packages/transformers/modeling_attn_mask_utils.py:114: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if (input_shape[-1] > 1 or self.sliding_window is not None) and self.is_causal:
/opt/homebrew/lib/python3.11/site-packages/transformers/modeling_attn_mask_utils.py:161: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if past_key_values_length > 0:
/opt/homebrew/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:119: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if seq_len > self.max_seq_len_cached:
/opt/homebrew/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:285: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
/opt/homebrew/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:292: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
/opt/homebrew/lib/python3.11/site-packages/transformers/models/mistral/modeling_mistral.py:304: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
Skipping token_type_ids input
Patching PyTorch conversion 'log' with <function MistralCoreMLConfig.patch_pytorch_ops.<locals>.log at 0x13a115300>
/opt/homebrew/lib/python3.11/site-packages/coremltools/models/_deprecation.py:27: FutureWarning: Function _TORCH_OPS_REGISTRY.__contains__ is deprecated and will be removed in 7.2.; Please use coremltools.converters.mil.frontend.torch.register_torch_op
  warnings.warn(msg, category=FutureWarning)
/opt/homebrew/lib/python3.11/site-packages/coremltools/models/_deprecation.py:27: FutureWarning: Function _TORCH_OPS_REGISTRY.__getitem__ is deprecated and will be removed in 7.2.; Please use coremltools.converters.mil.frontend.torch.register_torch_op
  warnings.warn(msg, category=FutureWarning)
/opt/homebrew/lib/python3.11/site-packages/coremltools/models/_deprecation.py:27: FutureWarning: Function _TORCH_OPS_REGISTRY.__delitem__ is deprecated and will be removed in 7.2.; Please use coremltools.converters.mil.frontend.torch.register_torch_op
  warnings.warn(msg, category=FutureWarning)
/opt/homebrew/lib/python3.11/site-packages/coremltools/models/_deprecation.py:27: FutureWarning: Function _TORCH_OPS_REGISTRY.__setitem__ is deprecated and will be removed in 7.2.; Please use coremltools.converters.mil.frontend.torch.register_torch_op
  warnings.warn(msg, category=FutureWarning)
Converting PyTorch Frontend ==> MIL Ops:   0%|                                                                                                                                             | 0/4506 [00:00<?, ? ops/s]Saving value type of int64 into a builtin type of int32, might lose precision!
Saving value type of int64 into a builtin type of int32, might lose precision!
Converting PyTorch Frontend ==> MIL Ops: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4505/4506 [00:01<00:00, 3255.50 ops/s]
Running MIL frontend_pytorch pipeline: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 13.02 passes/s]
Running MIL default pipeline:  14%|████████████████████                                                                                                                          | 10/71 [00:00<00:03, 15.93 passes/s]/opt/homebrew/lib/python3.11/site-packages/coremltools/converters/mil/mil/passes/defs/preprocess.py:267: UserWarning: Output, '5409', of the source model, has been renamed to 'var_5409' in the Core ML model.
  warnings.warn(msg.format(var.name, new_name))
Running MIL default pipeline:  73%|████████████████████████████████████████████████████████████████████████████████████████████████████████                                      | 52/71 [03:36<02:09,  6.79s/ passes]/opt/homebrew/lib/python3.11/site-packages/coremltools/converters/mil/mil/ops/defs/iOS15/elementwise_unary.py:894: RuntimeWarning: overflow encountered in cast
  return input_var.val.astype(dtype=string_to_nptype(dtype_val))
/opt/homebrew/lib/python3.11/site-packages/coremltools/converters/mil/mil/ops/defs/iOS15/elementwise_unary.py:896: RuntimeWarning: overflow encountered in cast
  return np.array(input_var.val).astype(dtype=string_to_nptype(dtype_val))
Running MIL default pipeline: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 71/71 [07:27<00:00,  6.30s/ passes]
Running MIL backend_mlprogram pipeline: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 168.96 passes/s]
zsh: killed     python3 -m exporters.coreml --model=mistralai/Mistral-7B-v0.1 
willwalker misty > /opt/homebrew/Cellar/python@3.11/3.11.7/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

jnicolaes commented 10 months ago

I have the same issue with a Macbook M3 Max with 96GB RAM.

Proryanator commented 8 months ago

Same thing with me as well, M3 Max w/ 36GB of RAM 🤔 I'm wondering if this is a Mac OS thing or a zsh thing, going to try and disable mac's automatic killing of processes: https://osxdaily.com/2012/05/15/disable-automatic-termination-of-apps-in-mac-os-x/

RE: that didn't work, still got killed. I ran it with sudo and I was able to get a bit farther it seems, right after the Running MIL backend_program pipeline I got this:

Restoring PyTorch conversion op 'log' to <function log at 0x13f499f80>.

It looks like a process ANECompiler is the one running when the exporters script seems to hang, not sure if that's useful information.

Gonna keep trying things, maybe even debug the script if I have time.

Proryanator commented 8 months ago

I tried cpu_only, cpu_and_gpu, cpu_and_ne and it still got killed. Was hoping it was specific to trying to optimize it for NE (based on the top process in the screenshot) but apparrently not.

I am also doing a mistral model, SynthIA 7B, maybe this is unique to those models?

Can try a different type to see.

So far I've tried:

I've also tried the version of exporters when mistral support was added, same thing it just hangs and zsh kills the process.

Proryanator commented 8 months ago

I've tried:

3.10.13 -> suggested here, did not work. Crashed way earlier. I also tried using a lower version of tqdm as suggsted in the same thread.

I closed all other windows and just had the terminal open, got this other error sometimes but most of the time still gets killed:

RuntimeError: [MIL FileWriter]: Unknown error occured while writing data to the file.

Progress? 😆

Proryanator commented 8 months ago

I tried the same synthia model conversion, same setup, on an M2 Max, and the conversion at least worked.

Validation failed, but I wonder if this is M3* family of chip specific.

This also happens with coremltools directly, even when trying to quantize a relatively small model. Pretty sure it has nothing to do with this repo 😬

Proryanator commented 4 months ago

This still appears to be an issue, re-ran a conversion some time later. I can confirm that my M3 Max is using swap correctly but, seems to be that same There appear to be 1 leaked semaphore objects to clean up at shutdown error.

norbit8 commented 3 months ago

I tried the same synthia model conversion, same setup, on an M2 Max, and the conversion at least worked.

Validation failed, but I wonder if this is M3* family of chip specific.

This also happens with coremltools directly, even when trying to quantize a relatively small model. Pretty sure it has nothing to do with this repo 😬

Getting the same error when using python 3.12 and running on an intel based mac 😖

Proryanator commented 3 months ago

@norbit8 I realized it has more to do with the model getting loaded into memory right after it has been converted, if you wanted you could pull my branch of exporters (linked just above your comment), run the pip install command in there, and try again :)

huggingface / exporters

Exporter being killed #69