microsoft / CodeBERT

CodeBERT
MIT License
2.15k stars 442 forks source link

Question about UniXcoder input size #248

Open Yangget opened 1 year ago

Yangget commented 1 year ago

Hello, author.

Thanks for your open source contributions.

My task is to fine-tune the code generation part, but in my data set, the content that needs to be generated is relatively long, and the 150 in the paper is not enough. How can I set a larger output length, example, greater than 1024.

Yangget commented 1 year ago

There is another strange problem. I use the command fine-tune,

python run.py \
    --do_train \
    --model_name_or_path microsoft/unixcoder-base \
    --train_filename dataset/train.json \
    --dev_filename dataset/dev.json \
    --output_dir saved_models \
    --max_source_length 350 \
    --max_target_length 150 \
    --beam_size 3 \
    --train_batch_size 32 \
    --eval_batch_size 32 \
    --learning_rate 5e-5 \
    --gradient_accumulation_steps 1 \
    --num_train_epochs 30 

My data is styled as follows

{"code": "def construct ( self ) : annulus_1 = Annulus ( inner_radius = 0 . 5, outer_radius = 1 ) . shift ( UP ) ; annulus_2 = Annulus ( inner_radius = 0 . 3, outer_radius = 0 . 6, color = RED ) . next_to ( annulus_1, DOWN ) ; self . add ( annulus_1, annulus_2 ) ; self . wait ( 1 ) ;", "nl": "Declare two concentric circles, the first with an inner diameter of 0.5 and an outer diameter of 1, moving upwards by one unit. The second inner diameter is 0.3, the outer diameter is 0.6, and the color is red, following the first circle. Add these two circles"}

The amount of data is only 187.

this is loss

04/12/2023 12:56:12 - INFO - __main__ -   ***** Running training *****
04/12/2023 12:56:12 - INFO - __main__ -     Num examples = 187
04/12/2023 12:56:12 - INFO - __main__ -     Batch size = 2
04/12/2023 12:56:12 - INFO - __main__ -     Num epoch = 30
/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
04/12/2023 12:56:41 - INFO - __main__ -   epoch 1 step 100 loss 1.5808
04/12/2023 12:57:05 - INFO - __main__ -   epoch 2 step 200 loss 0.8058
04/12/2023 12:57:28 - INFO - __main__ -   epoch 3 step 300 loss 0.5501
04/12/2023 12:57:52 - INFO - __main__ -   epoch 4 step 400 loss 0.3648
04/12/2023 12:58:16 - INFO - __main__ -   epoch 5 step 500 loss 0.2547
04/12/2023 12:58:39 - INFO - __main__ -   epoch 6 step 600 loss 0.1813
04/12/2023 12:59:03 - INFO - __main__ -   epoch 7 step 700 loss 0.1483
04/12/2023 12:59:27 - INFO - __main__ -   epoch 8 step 800 loss 0.0975
04/12/2023 12:59:50 - INFO - __main__ -   epoch 9 step 900 loss 0.077
04/12/2023 13:00:14 - INFO - __main__ -   epoch 10 step 1000 loss 0.0658
04/12/2023 13:00:37 - INFO - __main__ -   epoch 11 step 1100 loss 0.0535
04/12/2023 13:01:01 - INFO - __main__ -   epoch 12 step 1200 loss 0.0385
04/12/2023 13:01:24 - INFO - __main__ -   epoch 13 step 1300 loss 0.0259
04/12/2023 13:01:48 - INFO - __main__ -   epoch 14 step 1400 loss 0.0218
04/12/2023 13:02:12 - INFO - __main__ -   epoch 15 step 1500 loss 0.0175
04/12/2023 13:02:36 - INFO - __main__ -   epoch 17 step 1600 loss 0.014
04/12/2023 13:02:59 - INFO - __main__ -   epoch 18 step 1700 loss 0.0142
04/12/2023 13:03:23 - INFO - __main__ -   epoch 19 step 1800 loss 0.0123
04/12/2023 13:03:47 - INFO - __main__ -   epoch 20 step 1900 loss 0.0093
04/12/2023 13:04:10 - INFO - __main__ -   epoch 21 step 2000 loss 0.0106
04/12/2023 13:04:34 - INFO - __main__ -   epoch 22 step 2100 loss 0.0073
04/12/2023 13:04:57 - INFO - __main__ -   epoch 23 step 2200 loss 0.007
04/12/2023 13:05:21 - INFO - __main__ -   epoch 24 step 2300 loss 0.007
04/12/2023 13:05:45 - INFO - __main__ -   epoch 25 step 2400 loss 0.0073
04/12/2023 13:06:08 - INFO - __main__ -   epoch 26 step 2500 loss 0.0063
04/12/2023 13:06:32 - INFO - __main__ -   epoch 27 step 2600 loss 0.0048
04/12/2023 13:06:56 - INFO - __main__ -   epoch 28 step 2700 loss 0.0037
04/12/2023 13:07:19 - INFO - __main__ -   epoch 29 step 2800 loss 0.0044

When I predict, it produces messy results

def construct ( self ) :
    # Declare two concentric circles, the first with an inner diameter of 0.5 and an outer diameter of 1, moving upwards by one unit. The second inner diameter is 0.3, the outer diameter is 0.6, and the color is red, following the first circle. Add these two circles
filterfilterfilterfilterfilterfilterfilterfilterfilterfiltersubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribesubscribe

Do you know if this could be due to the statement, can you give me some advice? I would appreciate it!

guoday commented 1 year ago
  1. To increase the maximum length of generation, set max_target_length to 512 or higher.
  2. UniXcoder overfits on the dataset due to the limited amount of training data.