huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
211 stars 63 forks source link

Compiling SDXL Turbo causes core dump #377

Closed satsumas closed 11 months ago

satsumas commented 11 months ago

I am following the instructions here for compiling SDXL Turbo: https://huggingface.co/docs/optimum-neuron/tutorials/stable_diffusion#stable-diffusion-xl-turbo, i.e. running

optimum-cli export neuron --model stabilityai/sdxl-turbo --task stable-diffusion-xl --batch_size 1 --height 512 --width 512 --auto_cast matmul --auto_cast_type bf16 sdxl_turbo_neuron/

I'm running on inf2.8xl, with optimum==1.15.0 optimum-neuron==0.0.13

I get this exception:

Validating text_encoder model...
2023-Dec-13 16:44:18.626441 190864:229706 ERROR   NRT:nrt_allocate_neuron_cores               NeuronCore(s) not available - Requested:2 Available:0
terminate called after throwing an instance of 'c10::Error'
  what():  The PyTorch Neuron Runtime could not be initialized. Neuron Driver issues are logged
to your system logs. See the Neuron Runtime's troubleshooting guide for help on this
topic: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/
Exception raised from initialize at /opt/workspace/KaenaPyTorchRuntime/neuron_op/runtime.cpp:195 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f84d50c0457 in /home/ubuntu/repos/InternalForkComfyUI/inf2_venv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7f84d508a4b5 in /home/ubuntu/repos/InternalForkComfyUI/inf2_venv/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: neuron::NeuronRuntime::initialize() + 0xd21 (0x7f84d1414d31 in /home/ubuntu/repos/InternalForkComfyUI/inf2_venv/lib/python3.8/site-packages/torch_neuronx/lib/libtorchneuron.so)
frame #3: neuron::Model::blocking_load() + 0x1dd (0x7f84d15076ed in /home/ubuntu/repos/InternalForkComfyUI/inf2_venv/lib/python3.8/site-packages/torch_neuronx/lib/libtorchneuron.so)
frame #4: std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::shared_ptr<neuron::NeuronModel> (neuron::Model::*)(), neuron::Model*> > >::_M_run() + 0x31 (0x7f84d150a801 in /home/ubuntu/repos/InternalForkComfyUI/inf2_venv/lib/python3.8/site-packages/torch_neuronx/lib/libtorchneuron.so)
frame #5: <unknown function> + 0xd6df4 (0x7f852dfbfdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #6: <unknown function> + 0x8609 (0x7f859db21609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #7: clone + 0x43 (0x7f859dc5b353 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)
Traceback (most recent call last):
  File "/home/ubuntu/repos/InternalForkComfyUI/inf2_venv/bin/optimum-cli", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/repos/InternalForkComfyUI/inf2_venv/lib/python3.8/site-packages/optimum/commands/optimum_cli.py", line 163, in main
    service.run()
  File "/home/ubuntu/repos/InternalForkComfyUI/inf2_venv/lib/python3.8/site-packages/optimum/commands/export/neuronx.py", line 155, in run
    subprocess.run(full_command, shell=True, check=True)
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 -m optimum.exporters.neuron --model stabilityai/sdxl-turbo --task stable-diffusion-xl --batch_size 1 --height 512 --width 512 --auto_cast matmul --auto_cast_type bf16 sdturboneuron2/' returned non-zero exit status 134.

neuron-top demonstrates that I have two available neuron cores with 0% utilisation.

JingyaHuang commented 11 months ago

Hi @satsumas! According to the error log, it seems to be an issue of the environment, could you send me your neuron setup with the following command?

apt list --installed | grep aws-neuron
pip3 list | grep -e neuron -e xla -e torch -e diffusers -e optimum

Meanwhile, you could check if reinstalling the neuron SDK could solve the issue: details here.

satsumas commented 11 months ago

Thank you!

apt list --installed | grep aws-neuron

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

 grep -e neuron -e xla -e torch -e diffusers -e optimumaws-neuronx-collectives/unknown,now 2.18.19.0-f7a1f7a35 amd64 [installed]
aws-neuronx-dkms/unknown,now 2.14.5.0 amd64 [installed]
aws-neuronx-runtime-lib/unknown,now 2.18.15.0-d9ebf86cc amd64 [installed]
aws-neuronx-tools/unknown,now 2.15.4.0 amd64 [installed]
pip3 list | grep -e neuron -e xla -e torch -e diffusers -e optimum
aws-neuronx-runtime-discovery 2.9
diffusers                     0.24.0
libneuronxla                  0.5.538
neuronx-cc                    2.11.0.34+c5231f848
neuronx-distributed           0.5.0
neuronx-hwm                   2.11.0.2+e34678757
optimum                       1.15.0
optimum-neuron                0.0.13
torch                         1.13.1
torch-neuronx                 1.13.1.1.12.0
torch-xla                     1.13.1+torchneuronc
torchsde                      0.2.6
torchvision                   0.14.1
transformers-neuronx          0.8.268

I set this VM up a few months ago, and although I recently repeated the setup guide in case my drivers were outdated, its still possible that something out of date.

I'll try uninstalling and reinstalling the SDK.

satsumas commented 11 months ago

OK, so I re-set up the environment. The issue is that using the compilation command from the docs doesn't seem compatible with the optimum-neuron I am running:

optimum-cli export neuron --model stabilityai/sdxl-turbo --task stable-diffusion-xl --batch_size 1 --height 512 --width 512 --auto_cast matmul --auto_cast_type bf16 sdxl_turbo_neuron/ produced the following exception following output

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/venvs/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/exporters/neuron/__main__.py", line 342, in <module>
    main()
  File "/home/ubuntu/venvs/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/exporters/neuron/__main__.py", line 327, in main
    main_export(
  File "/home/ubuntu/venvs/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/exporters/neuron/__main__.py", line 240, in main_export
    model.feature_extractor.save_pretrained(output.joinpath("feature_extractor"))
AttributeError: 'NoneType' object has no attribute 'save_pretrained'
Traceback (most recent call last):

The AttributeError is arising because when main tries to check the feature_extractor attribute of the model. The sdxl turbo model has a feature_extractor attribute, but it is set to None.

If I just comment out lines 239-240 of optimum/exporters/neuron/main.py, to avoid that check (in case this is a bug and the correct behaviour would be to test whether the model has a non-None feature_extractor), then the model does compile. But it fails validation and validation ends with an error:

The maximum absolute difference between the output of the reference model and the Neuron exported model is not within the set tolerance 0.001:
- sample: max diff = 0.033275306224823
An error occured with the error message: Validation of unet fails: Unknown opcode for unpickling at 0xffffffffffffffed: 237.
 The exported model was saved at: sdxl_turbo_neuron

That same unpickling error persists when I try to use the model, as in these docs: https://huggingface.co/docs/optimum-neuron/tutorials/stable_diffusion#stable-diffusion-xl-turbo

 >>> pipe = NeuronStableDiffusionXLPipeline.from_pretrained("sdxl_turbo_neuron/", data_parallel_mode="all")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/venvs/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/modeling_base.py", line 372, in from_pretrained
    return from_pretrained_method(
  File "/home/ubuntu/venvs/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/neuron/modeling_diffusion.py", line 452, in _from_pretrained
    pipe = cls.load_model(
  File "/home/ubuntu/venvs/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/neuron/modeling_diffusion.py", line 266, in load_model
    torch.jit.load(unet_path),
  File "/home/ubuntu/venvs/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/jit/_serialization.py", line 162, in load
    cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files)
RuntimeError: Unknown opcode for unpickling at 0xffffffffffffffed: 237

One thought: there is a warning at the start of the compilation logs, relating to the file causing the AttributeError (this warning is emitted whether or not I edit that file in the way described above). Do you have more context for this?

/usr/lib/python3.8/runpy.py:127: RuntimeWarning: 'optimum.exporters.neuron.__main__' found in sys.modules after import of package 'optimum.exporters.neuron', but prior to execution of 'optimum.exporters.neuron.__main__'; this may result in unpredictable behaviour
  warn(RuntimeWarning(msg))

After following the installation instructions for pytorch neuron-x and optimum[neuronx] I end up with optimum-neuron==0.0.13, and I can see that there are more recent releases. Will try to install those now.

satsumas commented 11 months ago

Building opimum-neuron from source allows the compilation to finish sucessfully!

But now I cannot call the model.... It seems like the same problem as https://github.com/huggingface/optimum-neuron/issues/364#issuecomment-1845943905

>>> pipe = NeuronStableDiffusionXLPipeline.from_pretrained("sdxl_turbo_neuron2/", data
_parallel_mode="all")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/venvs/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/modeling_base.py", line 372, in from_pretrained
    return from_pretrained_method(
  File "/home/ubuntu/venvs/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/neuron/modeling_diffusion.py", line 503, in _from_pretrained
    dynamic_batch_size=neuron_configs[DIFFUSION_MODEL_UNET_NAME].dynamic_batch_size,
KeyError: 'unet'

Pulling the pre-compiled binary from Jingya/sdxl-turbo-neuronx works though.

JingyaHuang commented 11 months ago

Hi @satsumas, in the folder sdxl_turbo_neuron2/ you have locally, do you have the compiled unet model along with its config file under the subfolder unet?

JingyaHuang commented 11 months ago

In the issue you mentioned, Matt actually bumped into the issue because of a space issue in his machine leading to the compilation of UNet failed.

satsumas commented 11 months ago

aha, no I don't have a compiled unet but I did notice I was running out of space! Thank you -- I will look into that.