Open vhiwase opened 1 month ago
@vhiwase Apologies on the delay! Would you happen to know what dataset you might have been using - it's possible there are some weird out of bounds tokens causing errors
@danielhanchen Apologies for the delay in responding. I'm currently testing the model with results obtained from OCR processing using Azure Document Intelligence. The inputs consist of random chunks of text extracted from various documents.
@vhiwase No worries! Does this happen on other machines? Like in a Colab?
@danielhanchen You are correct that we trained the model on Amazon EC2 G6 Instances, and inference is working fine there. However, we hosted the model inference on a different machine—specifically, Amazon EC2 G6e Instances. Could this be related to the dtype setting?
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+.
Note:
G6 Instances: Feature up to 8 NVIDIA L4 Tensor Core GPUs with 24 GB of memory per GPU, and third generation AMD EPYC processors.
G6e Instances: Feature up to 8 NVIDIA L40S Tensor Core GPUs with 384 GB of total GPU memory (48 GB per GPU), and third generation AMD EPYC processors.
I attempted to serve the original base model of Llama 3.1 in 4-bit, both with and without setting
load_in_4bit
. Below are my observations.When
load_in_4bit = True
: The model throws the following error:However, this behavior does not occur immediately—it happens after the model has processed some initial data. The model also consumes 8 GB of memory.
Code:
When
load_in_4bit = False
: The model runs without errors and uses around 16 GB of memory.Code:
Based on these findings, it seems that if we trained with load_in_4bit = True, the same issue would persist in our fine-tuned model, as it is inherent to the base model.
I recommend that we should train this model again for load_in_4bit = True