Quantize the model to int8 and it gave this error:
ubuntu@ip-172-31-19-240:~/gpt-fast$ python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int8
Loading model ...
/opt/conda/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
Quantizing model weights for int8 weight-only symmetric per-channel quantization
Writing quantized weights to checkpoints/openlm-research/open_llama_7b/model_int8.pth
Quantization complete took 24.35 seconds
ubuntu@ip-172-31-19-240:~/gpt-fast$ python generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model_int8.pth
Traceback (most recent call last):
File "/home/ubuntu/gpt-fast/generate.py", line 18, in
torch._inductor.config.fx_graph_cache = True # Experimental feature to reduce compilation times, will be on by default in future
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/config_utils.py", line 72, in setattr
raise AttributeError(f"{self.name}.{name} does not exist")
AttributeError: torch._inductor.config.fx_graph_cache does not exist
System:
Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 5.15.0-1049-aws x86_64v)
Please note that Amazon EC2 P2 Instance is not supported on current DLAMI.
Quantize the model to int8 and it gave this error:
ubuntu@ip-172-31-19-240:~/gpt-fast$ python quantize.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --mode int8 Loading model ... /opt/conda/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() return self.fget.get(instance, owner)() Quantizing model weights for int8 weight-only symmetric per-channel quantization Writing quantized weights to checkpoints/openlm-research/open_llama_7b/model_int8.pth Quantization complete took 24.35 seconds
ubuntu@ip-172-31-19-240:~/gpt-fast$ python generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model_int8.pth Traceback (most recent call last): File "/home/ubuntu/gpt-fast/generate.py", line 18, in
torch._inductor.config.fx_graph_cache = True # Experimental feature to reduce compilation times, will be on by default in future
File "/opt/conda/lib/python3.10/site-packages/torch/_dynamo/config_utils.py", line 72, in setattr
raise AttributeError(f"{self.name}.{name} does not exist")
AttributeError: torch._inductor.config.fx_graph_cache does not exist
System:
Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 5.15.0-1049-aws x86_64v)