Closed satsumas closed 11 months ago
Hi @satsumas! According to the error log, it seems to be an issue of the environment, could you send me your neuron setup with the following command?
apt list --installed | grep aws-neuron
pip3 list | grep -e neuron -e xla -e torch -e diffusers -e optimum
Meanwhile, you could check if reinstalling the neuron SDK could solve the issue: details here.
Thank you!
apt list --installed | grep aws-neuron
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
grep -e neuron -e xla -e torch -e diffusers -e optimumaws-neuronx-collectives/unknown,now 2.18.19.0-f7a1f7a35 amd64 [installed]
aws-neuronx-dkms/unknown,now 2.14.5.0 amd64 [installed]
aws-neuronx-runtime-lib/unknown,now 2.18.15.0-d9ebf86cc amd64 [installed]
aws-neuronx-tools/unknown,now 2.15.4.0 amd64 [installed]
pip3 list | grep -e neuron -e xla -e torch -e diffusers -e optimum
aws-neuronx-runtime-discovery 2.9
diffusers 0.24.0
libneuronxla 0.5.538
neuronx-cc 2.11.0.34+c5231f848
neuronx-distributed 0.5.0
neuronx-hwm 2.11.0.2+e34678757
optimum 1.15.0
optimum-neuron 0.0.13
torch 1.13.1
torch-neuronx 1.13.1.1.12.0
torch-xla 1.13.1+torchneuronc
torchsde 0.2.6
torchvision 0.14.1
transformers-neuronx 0.8.268
I set this VM up a few months ago, and although I recently repeated the setup guide in case my drivers were outdated, its still possible that something out of date.
I'll try uninstalling and reinstalling the SDK.
OK, so I re-set up the environment. The issue is that using the compilation command from the docs doesn't seem compatible with the optimum-neuron I am running:
optimum-cli export neuron --model stabilityai/sdxl-turbo --task stable-diffusion-xl --batch_size 1 --height 512 --width 512 --auto_cast matmul --auto_cast_type bf16 sdxl_turbo_neuron/
produced the following exception following output
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/ubuntu/venvs/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/exporters/neuron/__main__.py", line 342, in <module>
main()
File "/home/ubuntu/venvs/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/exporters/neuron/__main__.py", line 327, in main
main_export(
File "/home/ubuntu/venvs/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/exporters/neuron/__main__.py", line 240, in main_export
model.feature_extractor.save_pretrained(output.joinpath("feature_extractor"))
AttributeError: 'NoneType' object has no attribute 'save_pretrained'
Traceback (most recent call last):
The AttributeError is arising because when main tries to check the feature_extractor
attribute of the model. The sdxl turbo model has a feature_extractor
attribute, but it is set to None.
If I just comment out lines 239-240 of optimum/exporters/neuron/main.py, to avoid that check (in case this is a bug and the correct behaviour would be to test whether the model has a non-None feature_extractor
), then the model does compile. But it fails validation and validation ends with an error:
The maximum absolute difference between the output of the reference model and the Neuron exported model is not within the set tolerance 0.001:
- sample: max diff = 0.033275306224823
An error occured with the error message: Validation of unet fails: Unknown opcode for unpickling at 0xffffffffffffffed: 237.
The exported model was saved at: sdxl_turbo_neuron
That same unpickling error persists when I try to use the model, as in these docs: https://huggingface.co/docs/optimum-neuron/tutorials/stable_diffusion#stable-diffusion-xl-turbo
>>> pipe = NeuronStableDiffusionXLPipeline.from_pretrained("sdxl_turbo_neuron/", data_parallel_mode="all")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ubuntu/venvs/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/modeling_base.py", line 372, in from_pretrained
return from_pretrained_method(
File "/home/ubuntu/venvs/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/neuron/modeling_diffusion.py", line 452, in _from_pretrained
pipe = cls.load_model(
File "/home/ubuntu/venvs/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/neuron/modeling_diffusion.py", line 266, in load_model
torch.jit.load(unet_path),
File "/home/ubuntu/venvs/aws_neuron_venv_pytorch/lib/python3.8/site-packages/torch/jit/_serialization.py", line 162, in load
cpp_module = torch._C.import_ir_module(cu, str(f), map_location, _extra_files)
RuntimeError: Unknown opcode for unpickling at 0xffffffffffffffed: 237
One thought: there is a warning at the start of the compilation logs, relating to the file causing the AttributeError (this warning is emitted whether or not I edit that file in the way described above). Do you have more context for this?
/usr/lib/python3.8/runpy.py:127: RuntimeWarning: 'optimum.exporters.neuron.__main__' found in sys.modules after import of package 'optimum.exporters.neuron', but prior to execution of 'optimum.exporters.neuron.__main__'; this may result in unpredictable behaviour
warn(RuntimeWarning(msg))
After following the installation instructions for pytorch neuron-x and optimum[neuronx] I end up with optimum-neuron==0.0.13, and I can see that there are more recent releases. Will try to install those now.
Building opimum-neuron from source allows the compilation to finish sucessfully!
But now I cannot call the model.... It seems like the same problem as https://github.com/huggingface/optimum-neuron/issues/364#issuecomment-1845943905
>>> pipe = NeuronStableDiffusionXLPipeline.from_pretrained("sdxl_turbo_neuron2/", data
_parallel_mode="all")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ubuntu/venvs/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/modeling_base.py", line 372, in from_pretrained
return from_pretrained_method(
File "/home/ubuntu/venvs/aws_neuron_venv_pytorch/lib/python3.8/site-packages/optimum/neuron/modeling_diffusion.py", line 503, in _from_pretrained
dynamic_batch_size=neuron_configs[DIFFUSION_MODEL_UNET_NAME].dynamic_batch_size,
KeyError: 'unet'
Pulling the pre-compiled binary from Jingya/sdxl-turbo-neuronx works though.
Hi @satsumas, in the folder sdxl_turbo_neuron2/
you have locally, do you have the compiled unet model along with its config file under the subfolder unet
?
In the issue you mentioned, Matt actually bumped into the issue because of a space issue in his machine leading to the compilation of UNet failed.
aha, no I don't have a compiled unet
but I did notice I was running out of space! Thank you -- I will look into that.
I am following the instructions here for compiling SDXL Turbo: https://huggingface.co/docs/optimum-neuron/tutorials/stable_diffusion#stable-diffusion-xl-turbo, i.e. running
I'm running on inf2.8xl, with optimum==1.15.0 optimum-neuron==0.0.13
I get this exception:
neuron-top
demonstrates that I have two available neuron cores with 0% utilisation.