'tuple' object has no attribute 'loss'

skye95git commented 2 years ago

Hi, I want to run CodeT5-base on code generation task. I run the command: python run_exp.py --model_tag codet5_base --task concode --sub_task none

There is an error: 'tuple' object has no attribute 'loss'.

I try to change outputs = model(input_ids=source_ids, attention_mask=source_mask, labels=target_ids, decoder_attention_mask=target_mask) to outputs, _ = model(input_ids=source_ids, attention_mask=source_mask, labels=target_ids, decoder_attention_mask=target_mask)

There is an error: too many values to unpack (expected 2)

What should I do?

yuewang-sf commented 2 years ago

Hi @skye95git, I could not reproduce your issue. You can check whether the model is a T5ForConditionalGeneration object or directly print the outputs. Also, make sure you download the correct version of transformers (>= 4.6.1).

skye95git commented 2 years ago

Hi @skye95git, I could not reproduce your issue. You can check whether the model is a T5ForConditionalGeneration object or directly print the outputs. Also, make sure you download the correct version of transformers (>= 4.6.1).

Thanks for your reply! After I update the transformers , it worked. I want to pre-train a model using my data. Do you plan to share the pre-training code?

skye95git commented 2 years ago

Hi, I want to experience the generated code of Codet5. How to use the model to generate code after fine-tuning?

yuewang-sf commented 2 years ago

Hi @skye95git, I could not reproduce your issue. You can check whether the model is a T5ForConditionalGeneration object or directly print the outputs. Also, make sure you download the correct version of transformers (>= 4.6.1).

Thanks for your reply! After I update the transformers , it worked. I want to pre-train a model using my data. Do you plan to share the pre-training code?

We currently do not have a plan to release the pre-training code, which should not be difficult to implement based on the paper. We are also happy to take questions regarding its implementation.

yuewang-sf commented 2 years ago

Hi, I want to experience the generated code of Codet5. How to use the model to generate code after fine-tuning?

You can refer to the following function for this: https://github.com/salesforce/CodeT5/blob/100c7e55c1e120600a2e114e828a157d265a2411/run_gen.py#L84

skye95git commented 2 years ago

Hi, I want to experience the generated code of Codet5. How to use the model to generate code after fine-tuning?

You can refer to the following function for this: https://github.com/salesforce/CodeT5/blob/100c7e55c1e120600a2e114e828a157d265a2411/run_gen.py#L84

Thanks for your reply! When I fine-tuning the model, I meet an error:

There are many similar absolute paths in the repository. For example, In models.py

In calc_code_bleu.py

In dataflow_match.py

In syntax_match.py

It would be nice to be reminded in the readme that alternate paths are needed.

skye95git commented 2 years ago

Hi, I have finished fine-tune. The result in sh/results is:

Are the results in sh/results evaluated on concode's test set or dev set? If it is evaluated on concode's dev set, how to evaluate on concode's test set?

skye95git commented 2 years ago

I read the source code in run_gen.py. I find the result in sh/results is evaluated on concode's test set.

I want to see the prediction result. Are the generated results of the test dataset stored in the sh/saved_models/concode/codet5_base_all_lr10_bs32_src320_trg150_pat3_e30/prediction ?

What do test_*.gold, test_*.output and test_*.src in the folder stand for respectively?

Is input data stored in test_*.src? Is output data stored in test_*.output?

skye95git commented 2 years ago

Hi, I want to experience the generated code of Codet5. How to use the model to generate code after fine-tuning?

You can refer to the following function for this: https://github.com/salesforce/CodeT5/blob/100c7e55c1e120600a2e114e828a157d265a2411/run_gen.py#L84

Hi, the paper describes we additionally collect two datasets of C/CSharp from BigQuery. How do you parse C and CSharp downloaded from Bigquery to extract functions? I also want to parse the source code I've acquired and retrain the model. Is it convenient for you to share the parsed code?

yuewang-sf commented 2 years ago

I read the source code in run_gen.py. I find the result in sh/results is evaluated on concode's test set.

I want to see the prediction result. Are the generated results of the test dataset stored in the sh/saved_models/concode/codet5_base_all_lr10_bs32_src320_trg150_pat3_e30/prediction ?

What do test_*.gold, test_*.output and test_*.src in the folder stand for respectively?

Is input data stored in test_*.src? Is output data stored in test_*.output?

Hi @skye95git, yes. Your understanding is correct. The test_*.src is the source input, test_*.output is the model output, and test_*.gold is the ground-truth target output.

yuewang-sf commented 2 years ago

Hi, I want to experience the generated code of Codet5. How to use the model to generate code after fine-tuning?

You can refer to the following function for this: https://github.com/salesforce/CodeT5/blob/100c7e55c1e120600a2e114e828a157d265a2411/run_gen.py#L84

Hi, the paper describes we additionally collect two datasets of C/CSharp from BigQuery. How do you parse C and CSharp downloaded from Bigquery to extract functions? I also want to parse the source code I've acquired and retrain the model. Is it convenient for you to share the parsed code?

We parse it using the tree-sitter similar to the CodeSearchNet dataset. We will release this additional data (C/C#) soon.

skye95git commented 2 years ago

Hi, I want to experience the generated code of Codet5. How to use the model to generate code after fine-tuning?

You can refer to the following function for this: https://github.com/salesforce/CodeT5/blob/100c7e55c1e120600a2e114e828a157d265a2411/run_gen.py#L84

Hi, the paper describes we additionally collect two datasets of C/CSharp from BigQuery. How do you parse C and CSharp downloaded from Bigquery to extract functions? I also want to parse the source code I've acquired and retrain the model. Is it convenient for you to share the parsed code?

We parse it using the tree-sitter similar to the CodeSearchNet dataset. We will release this additional data (C/C#) soon.

That's cool. In addition to the additional data (C/C#) you will release, I want to parse the source code of C and C# that we obtained. Is it convenient for you to share the parsed code for C and C#?

There is a fork of the awesome function_parser library from Github's CodeSearchNet Challenge repo. Currently, it supports 6 languages: Python, Java, Go, Php, Ruby, and Javascript. But it doesn't support C and C#. I tried to use the tree-sitter similar to the CodeSearchNet dataset to parse C and C#. Unfortunately, the effect isn't satisfactory.

I would be grateful if you could share share the C and C# parse codes, I plan to fork it and update the function_parser. It can help more people.

skye95git commented 2 years ago

Hi, what is the difference between concode_field_sep and concode_elem_sep in the NL field in the Concode dataset? Which one represents the variable? which one represents the function?

The description about CONCODE in CodeXGLUE is nl combines natural language description and class environment. Elements in class environment are seperated by special tokens like con_elem_sep and con_func_sep.

If concode_elem_sep refers to con_elem_sep, it represents the variable. The content in dev.json seems different. The Line 269 in dev.json:

{
    "code": "int function ( double [ ] arg0 , double [ ] arg1 ) { int loc0 = arg0 . length - arg1 . length ; outer : for ( int loc1 = 0 ; loc1 <= loc0 ; loc1 ++ ) { for ( int loc2 = 0 ; loc2 < arg1 . length ; loc2 ++ ) { if ( ne ( arg0 [ loc1 + loc2 ] , arg1 [ loc2 ] ) ) { continue outer ; } } return ( loc1 ) ; } return ( - 1 ) ; }",
    "nl": "searches for the first subsequence of a that matches sub elementwise . elements of sub are considered to match elements of a if they pass the #eq test . concode_field_sep double max_ratio concode_elem_sep double min_ratio concode_elem_sep boolean off concode_field_sep boolean isElemMatch concode_elem_sep int compare concode_elem_sep boolean isSubset concode_elem_sep boolean ne concode_elem_sep boolean lt concode_elem_sep boolean gte concode_elem_sep void set_rel_diff concode_elem_sep boolean eq concode_elem_sep boolean lte concode_elem_sep boolean gt"
}

The env in nl is

concode_field_sep double max_ratio 
concode_elem_sep double min_ratio 
concode_elem_sep boolean off 
concode_field_sep boolean isElemMatch 
concode_elem_sep int compare 
concode_elem_sep boolean isSubset 
concode_elem_sep boolean ne 
concode_elem_sep boolean lt 
concode_elem_sep boolean gte 
concode_elem_sep void set_rel_diff 
concode_elem_sep boolean eq 
concode_elem_sep boolean lte 
concode_elem_sep boolean gt

The code is

int function(double[] arg0, double[] arg1) {
    int loc0 = arg0 . length - arg1 . length
    outer: for (int loc1=0 loc1 <= loc0 loc1 + +) {
                for (int loc2=0 loc2 < arg1 . length loc2 + +) {
                    if (ne(arg0[loc1 + loc2], arg1[loc2])) {
                        continue outer
                    }
                } 
                return (loc1)
            } 
    return (- 1)
}

The ne() in the code field is a function, not a variable. But But it's concode_elem_sep in env. So I'm a little confused.

skye95git commented 2 years ago

Hi @skye95git, I could not reproduce your issue. You can check whether the model is a T5ForConditionalGeneration object or directly print the outputs. Also, make sure you download the correct version of transformers (>= 4.6.1).

Thanks for your reply! After I update the transformers , it worked. I want to pre-train a model using my data. Do you plan to share the pre-training code?

We currently do not have a plan to release the pre-training code, which should not be difficult to implement based on the paper. We are also happy to take questions regarding its implementation.

Hi, I try to implement pre-training code. I have a couple of questions about the pre-training data:

The paper describes you employ CodeSearchNet as pre-training data. Do I need to preprocess CodeSearchNet before pretraining? If necessary, how should it be preprocessed?
The statistical data in table 1 are different from CodeSearchNet to some extent. In CodeT5 table1:

In CodeSearchNet table1:

CodeT5 uses less pre-training data than the original Codesearchnet data. Did you do data cleansing before pre-training?

skye95git commented 2 years ago

The data set used to fine-tune the code generation task is concode, which contains only the Java corpus. So can CodeT5 only generate Java code, or can all eight of the code used for pre-training be generated? If so, does that mean CodeT5 can generate code directly without fine-tuning it?

salesforce / CodeT5

'tuple' object has no attribute 'loss' #5