Training a reward model with LoRA gets weird logs when using SEQ_CLS

System Info

Platform: Linux-5.15.0-121-generic-x86_64-with-glibc2.35
Python version: 3.10.15
PyTorch version: 2.5.1
CUDA device(s): NVIDIA GeForce RTX 3060
Transformers version: 4.46.2
Accelerate version: 1.1.1
Accelerate config: not found
Datasets version: 3.1.0
HF Hub version: 0.26.2
TRL version: 0.12.0
bitsandbytes version: not installed
DeepSpeed version: not installed
Diffusers version: not installed
Liger-Kernel version: not installed
LLM-Blender version: not installed
OpenAI version: not installed
PEFT version: 0.13.2

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder
[X] My own task or dataset (give details below)

Reproduction

Hi, I'm using the example code from here, running it with the following params:

python reward_modeling.py \
    --model_name_or_path Qwen/Qwen2.5-Math-1.5B-Instruct \
    --dataset_name all_preference_pairs.jsonl \
    --output_dir Qwen2.5-Math-1.5B-Instruct-Reward-LoRA \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 32 \
    --num_train_epochs 3 \
    --gradient_checkpointing True \
    --learning_rate 1.0e-5 \
    --logging_steps 5 \
    --eval_strategy no \
    --eval_steps 50 \
    --max_length 4096 \
    --use_peft \
    --lora_r 32 \
    --lora_alpha 16

And I get reasonable logs like this:

{'loss': 0.7776, 'grad_norm': 5.775640487670898, 'learning_rate': 1.9948320413436695e-05, 'epoch': 0.01}                                                      
{'loss': 0.6384, 'grad_norm': 20.25526237487793, 'learning_rate': 1.7105943152454783e-05, 'epoch': 0.43}                                                      
{'loss': 0.4843, 'grad_norm': 27.412843704223633, 'learning_rate': 1.1111111111111113e-05, 'epoch': 1.33}                                                     
{'loss': 0.393, 'grad_norm': 34.913734436035156, 'learning_rate': 4.031007751937985e-06, 'epoch': 2.39}

However I also get this warning:

UserWarning: You are using a `task_type` that is different than `SEQ_CLS` for PEFT. This will lead to silent bugs Make sure to pass --lora_task_type SEQ_CLS when using this script with PEFT.

But when I add --lora_task_type SEQ_CLS the logs go haywire:

{'loss': 1.1914025258187767e+34, 'grad_norm': nan, 'learning_rate': 9.870801033591732e-06, 'epoch': 0.04}                                                     
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 9.483204134366925e-06, 'epoch': 0.15}                                                                        
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 9.354005167958657e-06, 'epoch': 0.19}

According to the docs, the dataset should be:

The RewardTrainer requires a implicit prompt preference dataset. It means that the dataset should only contain the columns "chosen" and "rejected" (and not "prompt"). The RewardTrainer supports both conversational and standard dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.

Here are a couple of lines from my dataset:

{"chosen": "<|im_start|>system\nPlease integrate natural language reasoning with programs to solve the problem above, and put your final answer within \\boxed{}.<|im_end|>\n<|im_start|>user\nSuppose that a polynomial of the form $p(x)=x^{2010} \\pm x^{2009} \\pm \\cdots \\pm x \\pm 1$ has no real roots. What is the maximum possible number of coefficients of -1 in $p$?<|im_end|>\n<|im_start|>assistant\nTo solve this problem, we need to understand the behavior of the polynomial $p(x) = x^{2010} \\pm x^{2009} \\pm \\cdots \\pm x \\pm 1$. The key point is that if $p(x)$ has no real roots, then all the roots must be complex. Since the polynomial has real coefficients, the complex roots must come in conjugate pairs. This means that if there is a complex root, its conjugate must also be a root. The only real roots possible are $\\pm 1$.\n\nTo maximize the number of coefficients of $-1$, we need to consider the polynomial $p(x)$ evaluated at $x = -1$. If $p(-1) \\neq 0$, then $-1$ cannot be a root of the polynomial. We can use this property to determine the maximum number of coefficients of $-1$.\n\nLet's denote the polynomial as $p(x) = x^{2010} + a_{2009}x^{2009} + \\cdots + a_1x + 1$, where each $a_i$ is either $1$ or $-1$. If we substitute $x = -1$ into the polynomial, we get:\n\\[ p(-1) = (-1)^{2010} + a_{2009}(-1)^{2009} + \\cdots + a_1(-1) + 1 = 1 - a_{2009} + a_{2008} - \\cdots - a_1 + 1. \\]\nThis simplifies to:\n\\[ p(-1) = 2 - (a_{2009} + a_{2008} + \\cdots + a_1). \\]\nFor $p(-1) \\neq 0$, we need:\n\\[ 2 - (a_{2009} + a_{2008} + \\cdots + a_1) \\neq 0. \\]\nThis means that the sum of the coefficients $a_{2009} + a_{2008} + \\cdots + a_1$ must not be equal to 2. The maximum number of coefficients of $-1$ is achieved when the sum of the coefficients is minimized, but not equal to 2. The maximum number of $-1$ coefficients is 1005, because if we have 1006 or more $-1$ coefficients, the sum will be less than or equal to $-1006$, which is not possible since the sum must be an even number (because there are 2010 terms in total and 1005 of them are $-1$).\n\nTherefore, the maximum possible number of coefficients of $-1$ is 1005. Let's confirm this with Python code.\n```python\n# We need to confirm that with 1005 coefficients of -1, the sum of the coefficients is not equal to 2.\r\n# The polynomial is x^2010 + a_2009*x^2009 + ... + a_1*x + 1, where a_i is either 1 or -1.\r\n# If we have 1005 coefficients of -1, the sum of the coefficients is -1005 + (2010 - 1005) = 1005.\r\n# This means the polynomial evaluated at x = -1 is 2 - 1005 = -1003, which is not zero.\r\n\r\n# Let's calculate the sum of the coefficients with 1005 coefficients of -1\r\nnum_terms = 2010\r\ncoefficients_of_minus_one = 1005\r\nsum_of_coefficients = num_terms - 2 * coefficients_of_minus_one\r\n\r\n# Check if the sum of coefficients is not equal to 2\r\nis_not_zero = sum_of_coefficients != 2\r\n\r\nprint((sum_of_coefficients, is_not_zero))\n```\n```output\n(0, True)\n```\nThe sum of the coefficients with 1005 coefficients of $-1$ is $0$, which means the polynomial evaluated at $x = -1$ is $2 - 0 = 2$, but this is not a root of the polynomial because the polynomial has no real roots. Therefore, the maximum possible number of coefficients of $-1$ is indeed 1005.\n\nSo the final answer is:\n\\[\n\\boxed{1005}\n\\]", "rejected": "<|im_start|>system\nPlease integrate natural language reasoning with programs to solve the problem above, and put your final answer within \\boxed{}.<|im_end|>\n<|im_start|>user\nSuppose that a polynomial of the form $p(x)=x^{2010} \\pm x^{2009} \\pm \\cdots \\pm x \\pm 1$ has no real roots. What is the maximum possible number of coefficients of -1 in $p$?<|im_end|>\n<|im_start|>assistant\nTo determine the maximum possible number of coefficients of -1 in the polynomial \\(p(x) = x^{2010} \\pm x^{2009} \\pm \\cdots \\pm x \\pm 1\\) that has no real roots, we need to analyze the properties of such polynomials. Specifically, we need to consider the behavior of polynomials with alternating signs and determine the maximum number of -1 coefficients without the polynomial having any real roots.\n\nA key observation is that a polynomial with only even powers of \\(x\\) (i.e., \\(x^{2010}, x^{2008}, \\ldots, x^2, 1\\)) and all coefficients equal to 1 will always have no real roots. This is because the polynomial \\(P(x) = x^{2010} + x^{2008} + \\cdots + x^2 + 1\\) is always positive for all real \\(x\\) (since all terms are non-negative and at least one term is positive). Adding or subtracting an odd power term (like \\(x^{2009}\\) or \\(x^{2007}\\)) can potentially introduce real roots, but we need to ensure that no such roots are created.\n\nLet's consider the polynomial with the maximum number of -1 coefficients. If we have too many -1 coefficients, the polynomial could potentially have real roots. The key is to balance the polynomial in such a way that it remains without real roots.\n\nTo ensure the polynomial has no real roots, we need to balance the polynomial in a way that the sum of the coefficients of the polynomial is positive. This can be achieved by having the same number of positive and negative coefficients, or by having a slight majority of positive coefficients.\n\nThe maximum number of -1 coefficients can be determined by considering the polynomial with alternating signs, but ensuring that the sum of the coefficients is positive. This means we can have up to 1005 -1 coefficients and still have a positive sum of coefficients.\n\nLet's implement this in Python to verify the maximum possible number of -1 coefficients.\n\n```python\n# The maximum number of -1 coefficients in the polynomial p(x) = x^2010 \u00b1 x^2009 \u00b1 ... \u00b1 x \u00b1 1\n# such that the polynomial has no real roots can be determined by ensuring the sum of the coefficients is positive.\n# This means we can have up to 1005 -1 coefficients and still have a positive sum of coefficients.\n\n# Let's verify this with Python code.\ndef max_negatives_in_polynomial(n):\n    # The polynomial x^n \u00b1 x^(n-1) \u00b1 ... \u00b1 x \u00b1 1\n    # We need to balance the polynomial such that the sum of the coefficients is positive.\n    # This means we can have up to n//2 - 1 negative coefficients and still have a positive sum.\n    return n // 2 - 1\n\n# For the polynomial of degree 2010\nmax_negatives = max_negatives_in_polynomial(2010)\nprint(max_negatives)\n```\n```output\n1004\n```\nThe maximum possible number of coefficients of -1 in the polynomial \\(p(x) = x^{2010} \\pm x^{2009} \\pm \\cdots \\pm x \\pm 1\\) that has no real roots is \\(\\boxed{1004}\\)."}


{"chosen": "<|im_start|>system\nPlease integrate natural language reasoning with programs to solve the problem above, and put your final answer within \\boxed{}.<|im_end|>\n<|im_start|>user\nSuppose that a polynomial of the form $p(x)=x^{2010} \\pm x^{2009} \\pm \\cdots \\pm x \\pm 1$ has no real roots. What is the maximum possible number of coefficients of -1 in $p$?<|im_end|>\n<|im_start|>assistant\nTo solve the problem, we need to understand the behavior of the polynomial $p(x) = x^{2010} \\pm x^{2009} \\pm \\cdots \\pm x \\pm 1$. The key observation here is that if $p(x)$ has no real roots, then all the roots of the polynomial must be complex. For a polynomial with real coefficients, complex roots come in conjugate pairs. Therefore, if there is a polynomial of the form $x^{2010} \\pm x^{2009} \\pm \\cdots \\pm x \\pm 1$ with no real roots, then the polynomial must have an even degree, and all its roots must be complex.\n\nThe polynomial $x^{2010} \\pm x^{2009} \\pm \\cdots \\pm x \\pm 1$ has 2011 terms, and we need to determine the maximum number of coefficients that can be $-1$ such that the polynomial has no real roots. The polynomial will have no real roots if and only if it is always positive or always negative for all real $x$. This means that the polynomial must be either a sum of squares or a difference of squares. Since the polynomial has an even degree and the leading coefficient is 1 (which is positive), the polynomial will be always positive if the sum of the absolute values of the coefficients is even. This means that the number of $-1$ coefficients must be even, and the number of $1$ coefficients must be even.\n\nGiven that the polynomial has 2011 terms in total, if we let the number of $-1$ coefficients be $k$, then the number of $1$ coefficients is $2011 - k$. For the polynomial to have no real roots, both $k$ and $2011 - k$ must be even. This means that $k$ must be odd. The maximum possible value for $k$ is 2010, because if $k = 2010$, then $2011 - k = 1$, which is odd. Therefore, the maximum possible number of coefficients of $-1$ in $p$ is 2010.\n\nLet's confirm this reasoning with Python and sympy:\n\n```python\n# We don't need to use sympy for this particular problem, but we can use it to verify the logic.\n\nimport sympy as sp\n\n# Define the polynomial and the variable\nx = sp.symbols('x')\nn = 2010\n\n# The polynomial is of the form x^n \u00b1 x^(n-1) \u00b1 ... \u00b1 x \u00b1 1\n# We need to find the maximum number of -1 coefficients such that the polynomial has no real roots\n\n# The polynomial will be always positive or always negative if the sum of the absolute values of the coefficients is even\n# This means the number of -1 coefficients must be even, and the number of 1 coefficients must be even\n\n# The total number of terms is n + 1 = 2011, so if we let the number of -1 coefficients be k, then the number of 1 coefficients is 2011 - k\n# For the polynomial to have no real roots, both k and 2011 - k must be even, which means k must be odd\n# The maximum possible value for k is 2010, because if k = 2010, then 2011 - k = 1, which is odd\n\n# Therefore, the maximum possible number of coefficients of -1 in p is 2010\nmax_neg_coefficients = 2010\n\nprint(max_neg_coefficients)\n```\n```output\n2010\n```\nThe maximum possible number of coefficients of $-1$ in the polynomial $p(x) = x^{2010} \\pm x^{2009} \\pm \\cdots \\pm x \\pm 1$ such that $p(x)$ has no real roots is \\(\\boxed{1005}\\).", "rejected": "<|im_start|>system\nPlease integrate natural language reasoning with programs to solve the problem above, and put your final answer within \\boxed{}.<|im_end|>\n<|im_start|>user\nSuppose that a polynomial of the form $p(x)=x^{2010} \\pm x^{2009} \\pm \\cdots \\pm x \\pm 1$ has no real roots. What is the maximum possible number of coefficients of -1 in $p$?<|im_end|>\n<|im_start|>assistant\nTo solve this problem, we need to understand the nature of the polynomial \\(p(x) = x^{2010} \\pm x^{2009} \\pm \\cdots \\pm x \\pm 1\\). The polynomial has 2011 terms, including the constant term 1. If it has no real roots, then it must not change sign for any real value of \\(x\\). This means that all the coefficients must be such that the polynomial does not cross the x-axis.\n\nThe polynomial \\(p(x)\\) can be written as \\(p(x) = x^{2010} + a_{2009}x^{2009} + \\cdots + a_1x + 1\\), where each \\(a_i\\) is either 1 or -1. The polynomial \\(p(x)\\) is a sum of terms where each term is either \\(x^{2010}\\) or \\(-x^{2010}\\) for \\(x^{2009}\\), and so on, down to \\(-x\\) or \\(+x\\), and finally \\(1\\).\n\nThe key observation is that the polynomial \\(p(x)\\) will not have any real roots if the sum of the coefficients of the polynomial is non-zero. If the sum of the coefficients is zero, then the polynomial could potentially have a real root. However, if the sum of the coefficients is non-zero, then the polynomial will not have any real roots because it will not change sign.\n\nThe sum of the coefficients of the polynomial \\(p(x)\\) is \\(1 + a_{2009} + a_{2008} + \\cdots + a_1 + 1\\). Since each \\(a_i\\) is either 1 or -1, the sum of the coefficients will be an even number (because there are 2011 terms, and each term is either 1 or -1). Therefore, the sum of the coefficients will be non-zero if and only if there is an odd number of terms with coefficient -1.\n\nThe maximum number of coefficients that can be -1 while still maintaining a non-zero sum is 2010, because if all but one of the coefficients are -1, then the sum will be -2010 + 1 = -2009, which is non-zero. If 2011 of the coefficients are -1, then the sum will be -2011, which is also non-zero. However, if 2011 of the coefficients are -1, then the polynomial will have a real root, so this is not allowed.\n\nTherefore, the maximum possible number of coefficients of -1 in \\(p(x)\\) is 2009. Let's confirm this with Python code.\n```python\n# The maximum number of coefficients that can be -1 while still maintaining a non-zero sum\n# is 2010 because if all but one of the coefficients are -1, then the sum will be -2010 + 1 = -2009,\n# which is non-zero. If 2011 of the coefficients are -1, then the sum will be -2011, which is also non-zero,\n# but this would mean the polynomial has a real root, so this is not allowed.\n\n# Therefore, the maximum possible number of coefficients of -1 in p(x) is 2009.\nmax_neg_coefficients = 2010 - 1\nprint(max_neg_coefficients)\n```\n```output\n2009\n```\nThe maximum possible number of coefficients of -1 in the polynomial \\(p(x) = x^{2010} \\pm x^{2009} \\pm \\cdots \\pm x \\pm 1\\) that has no real roots is \\(\\boxed{2009}\\)."}

Am I doing something wrong with the dataset, or is the model incompatible with seqforclassification, or is there a bug in trl? Any feedback would be appreciated.

Expected behavior

I would expect the logs to show reasonable loss throughout training with the lora_task_type set to SEQ_CLS, or at least looking for some feedback if I used the wrong type of dataset. Getting inf loss followed by 0.0 loss and nan's usually means I did something wrong...

Checklist

[X] I have checked that my issue isn't already filed (see open issues)
[X] I have included my system information
[X] Any code provided is minimal, complete, and reproducible (more on MREs)
[X] Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
[X] Any traceback provided is complete

huggingface / trl