huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
195 stars 59 forks source link

Add additional parameter for model quantization + other features in NeuronXXXModel #299

Open samir-souza opened 10 months ago

samir-souza commented 10 months ago

NeuronXXXModel classes (i.e. NeuronDecoderModel - optimum/neuron/modeling_decoder.py) invoke transformers-neuronx to compile the target model, however these classes don't pass all the supported input parameters and limit key features like model quantization.

Ask: Accept all the supported parameters like "neuron_config" and "context_length_estimate" (and others)

For instance, when using transformers-neuronx we need to pass neuron_config to quantize the model to int8.

neuron_config = NeuronConfig(
    quant=QuantizationConfig(quant_dtype='s8', dequant_dtype='bf16'),
)
kwargs = {
    "batch_size": batch_size,
    "amp": dtype,
    "tp_degree": tp_degree,
    "n_positions": n_positions,
    "unroll": None,
    "neuron_config": neuron_config,
    'context_length_estimate': [32, 128, 256, 400, 512, 850]
}
model = LlamaForSampling.from_pretrained(os.path.join(model_dir, "llama2-split"), **kwargs)
HuggingFaceDocBuilderDev commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

yahavb commented 1 month ago

we would like to support int8 weight storage like in https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide.html#int8-weight-storage-support to run the model on smaller instances.

HuggingFaceDocBuilderDev commented 4 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!

HuggingFaceDocBuilderDev commented 4 days ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Thank you!