huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
176 stars 51 forks source link

optimum-cli neuron consolidate -> OOM. Using only one Neuron Core #579

Open MarcoBFreitas opened 2 months ago

MarcoBFreitas commented 2 months ago

System Info

optimum-neuron == 0.0.21
optimum == 1.18.0
Using trn1.32xlarge with the Hugging Face Neuron Deep Learning AMI

Who can help?

@michaelbenayoun

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Reproduced as in https://huggingface.co/docs/optimum-neuron/en/tutorials/fine_tune_llama_7b

Run optimum-cli neuron consolidate dolly_llama/tensor_parallel_shards dolly_llama through instance terminal

Consolidating checkpoints from dolly_llama to the safetensors format...
2024-Apr-24 11:17:47.196792  8629:9806  ERROR  TDRV:dmem_alloc_internal                     Failed to alloc DEVICE memory: 22544384
2024-Apr-24 11:17:47.203397  8629:9806  ERROR  TDRV:dml_dump                                Wrote nrt memory alloc debug info to /tmp/nrt_mem_log_device_0_6628ea5b.csv
2024-Apr-24 11:17:47.209035  8629:9806  ERROR  TDRV:log_dev_mem                             Failed to allocate 21.500MB (usage: tensors) on ND 0:NC 0, current utilization:
        * total: 15.964GB
        * tensors: 15.964GB
        * runtime: 1.062KB
        * dma rings: 32.000KB

2024-Apr-24 11:17:47.224842  8629:9806  ERROR  TDRV:tensor_allocate                         Failed to allocate 22544384 bytes on DEVICE for tensor UNKNOWN.
2024-04-24 11:17:47.255676: E tensorflow/compiler/xla/stream_executor/stream.cc:2523] RESOURCE_EXHAUSTED: tensorflow/libtpu/neuron/nrt_adaptor.cc:157:Not enough device memory on Neuron core 0 for allocating size=22544384: nrt_tensor_allocate: status=4, error message="Failed to allocate a resource for requested operation".
2024-04-24 11:17:47.262571: F tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:476] Non-OK-status: session->session()->Run(session_work->feed_inputs, session_work->outputs_handles, &outputs) status: RESOURCE_EXHAUSTED: From /job:localservice/replica:0/task:0:
2 root error(s) found.
  (0) RESOURCE_EXHAUSTED: tensorflow/libtpu/neuron/nrt_adaptor.cc:157:Not enough device memory on Neuron core 0 for allocating size=22544384: nrt_tensor_allocate: status=4, error message="Failed to allocate a resource for requested operation".
         [[{{node _send_XRTAllocateFromTensor_3_0}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

  (1) RESOURCE_EXHAUSTED: tensorflow/libtpu/neuron/nrt_adaptor.cc:157:Not enough device memory on Neuron core 0 for allocating size=22544384: nrt_tensor_allocate: status=4, error message="Failed to allocate a resource for requested operation".
0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
  RESOURCE_EXHAUSTED: tensorflow/libtpu/neuron/nrt_adaptor.cc:157:Not enough device memory on Neuron core 0 for allocating size=22544384: nrt_tensor_allocate: status=4, error message="Failed to allocate a resource for requested operation".
*** Begin stack trace ***
        tsl::CurrentStackTrace()

        xla::util::MultiWait::Complete(std::function<void ()> const&)

        clone
*** End stack trace ***

Expected behavior

Expected successfull consolidation into safetensors.

MarcoBFreitas commented 2 months ago

I shared wrongly the Env:

Platform:

- Platform: Linux-5.15.0-1053-aws-x86_64-with-glibc2.29
- Python version: 3.8.10

Python packages:

- `optimum-neuron` version: 0.0.22.dev0
- `neuron-sdk` version: 2.18.0
- `optimum` version: 1.18.1
- `transformers` version: 4.36.2
- `huggingface_hub` version: 0.20.3
- `torch` version: 1.13.1+cu117
- `aws-neuronx-runtime-discovery` version: 2.9
- `libneuronxla` version: 0.5.971
- `neuronx-cc` version: 2.13.66.0+6dfecc895
- `neuronx-distributed` version: 0.7.0
- `neuronx-hwm` version: 2.12.0.0+422c9037c
- `torch-neuronx` version: 1.13.1.1.14.0
- `torch-xla` version: 1.13.1+torchneurone
- `transformers-neuronx` version: 0.10.0.21

Neuron Driver:

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

aws-neuronx-collectives/unknown,now 2.20.11.0-c101c322e amd64 [installed,upgradable to: 2.20.22.0-c101c322e]
aws-neuronx-dkms/unknown,now 2.15.9.0 amd64 [installed,upgradable to: 2.16.7.0]
aws-neuronx-oci-hook/unknown,now 2.2.45.0 amd64 [installed,upgradable to: 2.3.0.0]
aws-neuronx-runtime-lib/unknown,now 2.20.22.0-1b3ca6425 amd64 [installed]
aws-neuronx-tools/unknown,now 2.17.0.0 amd64 [installed,upgradable to: 2.17.1.0]
jianyinglangaws commented 1 month ago

Can you work around this issue adding this flag --use_xser=False to the run command?