microsoft / CodeXGLUE

CodeXGLUE
MIT License
1.51k stars 364 forks source link

How to pre-train CodeGPT and CodeGPT-adapted models for code generation? #75

Closed skye95git closed 2 years ago

skye95git commented 3 years ago

Hi, I want to repre-train CodeGPT and CodeGPT-adapted models for code generation. How are these models pretrained? Do you plan to share the pre-train code?

skye95git commented 3 years ago

I find the CodeGPT can be used in Code Completion and Code Generation. Is there any difference between the CodeGPT in the two tasks? Is it a pre-trained model that can be used for both tasks?

skye95git commented 3 years ago

When I try to fine-tune CodeGPT on code generation task. But there is an error:

捕获

I have try several times, meet the same error. What should I do? Is it a data download problem?

celbree commented 3 years ago

Hi @skye95git , CodeGPT and CodeGPT-adapted are both pretrained with casual language model as pretrained task like in GPT-2. Since it is also the training target of code completion task, you can just use the code for code completion task to re-pretrain or continue pretrain the model. And it can be used for both code completion task and code generation task because they are both autoregressive generation tasks. About the error, I think it might be due to the download speed. It's a timeout error according to the first line log. You can try downloading the model in a fast network.

skye95git commented 3 years ago

Hi @skye95git , CodeGPT and CodeGPT-adapted are both pretrained with casual language model as pretrained task like in GPT-2. Since it is also the training target of code completion task, you can just use the code for code completion task to re-pretrain or continue pretrain the model. And it can be used for both code completion task and code generation task because they are both autoregressive generation tasks. About the error, I think it might be due to the download speed. It's a timeout error according to the first line log. You can try downloading the model in a fast network.

Thanks for your reply! Yes, the error is due to the download speed. I have finished the fine-tune and evaluation CodeGPT on code generation task: 捕获2

But the inference result is different from result in readme. Why the result is zero? Is my operation wrong? 捕获1

celbree commented 3 years ago

We don't provide the test set ground truth, that's why the result is zero. You can generate your predictions and submit to codexglue@microsoft.com and we will send your results back.

skye95git commented 2 years ago

Hi, I calculate the CodeBLEU score on code generation task by this script in CodeXGLUE/Code-Code/code-to-code-trans/evaluator/CodeBLEU/. I use the CONCODE test set and run

python calc_code_bleu.py --refs /CodeXGLUE/Text-Code/text-to-code/evaluator/test.json --hyp /CodeXGLUE/Text-Code/text-to-code/evaluator/test.txt --lang java --params 0.25,0.25,0.25,0.25

The test.json is CONCODE test set, and the test.txt is the corresponding predicted result. There is an error ValueError: Incompatible Language version 11. Must be between 13 and 13. What should I do?

I have another question: Can I pass the Concode test set directly into the --refs parameter?

parser.add_argument('--refs', type=str, nargs='+', required=True,
                        help='reference files')
parser.add_argument('--hyp', type=str, required=True, 
                        help='hypothesis file')
parser.add_argument('--lang', type=str, required=True, 
                        choices=['java','js','c_sharp','php','go','python','ruby'],
                        help='programming language')
parser.add_argument('--params', type=str, default='0.25,0.25,0.25,0.25',
                        help='alpha, beta and gamma')
skye95git commented 2 years ago

Hi, when I try to fine-tune CodeGPT with pre-train model CodeGPT-small-java as fellow:

LANG=java
DATADIR=/platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/dataset/concode
OUTPUTDIR=/platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/save/concode_CodeGPT
PRETRAINDIR=microsoft/CodeGPT-small-java
LOGFILE=text2code_concode_CodeGPT.log
PER_NODE_GPU=8

python -m torch.distributed.launch --nproc_per_node=$PER_NODE_GPU run.py \
        --data_dir=$DATADIR \
        --langs=$LANG \
        --output_dir=$OUTPUTDIR \
        --pretrain_dir=$PRETRAINDIR \
        --log_file=$LOGFILE \
        --model_type=gpt2 \
        --block_size=512 \
        --do_train \
        --node_index 0 \
        --gpu_per_node $PER_NODE_GPU \
        --learning_rate=5e-5 \
        --weight_decay=0.01 \
        --evaluate_during_training \
        --per_gpu_train_batch_size=6 \
        --per_gpu_eval_batch_size=12 \
        --gradient_accumulation_steps=2 \
        --num_train_epochs=30 \
        --logging_steps=100 \
        --save_steps=5000 \
        --overwrite_output_dir \
        --seed=42

I've tried it a few times. There is an error:

10/09/2021 16:41:09 - INFO - __main__ -   [0, 36667, 12023, 1, 2]
10/09/2021 16:41:11 - INFO - filelock -   Lock 139879773943904 acquired on /home/linjiayi/.cache/huggingface/transformers/33595bb220c9f28a0b5f118f74b92e9452ea8b2d57f95ff63ead768fd6d78fe7.370b83843c894ed8a095a0d4746bed76f0357559edebf4023f38652b971ca917.lock
10/09/2021 16:41:21 - INFO - filelock -   Lock 139879773943904 released on /home/linjiayi/.cache/huggingface/transformers/33595bb220c9f28a0b5f118f74b92e9452ea8b2d57f95ff63ead768fd6d78fe7.370b83843c894ed8a095a0d4746bed76f0357559edebf4023f38652b971ca917.lock
HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Max retries exceeded with url: /microsoft/CodeGPT-small-java/1ce001a39943be8dc0ff6cf1ebd407608deb96b9b0a9bd1b45d786b1fc88e8ef (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f38288e2908>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))
Traceback (most recent call last):
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/urllib3/connection.py", line 170, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/urllib3/util/connection.py", line 73, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/urllib3/connectionpool.py", line 382, in _make_request
    self._validate_conn(conn)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/urllib3/connectionpool.py", line 1010, in _validate_conn
    conn.connect()
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/urllib3/connection.py", line 353, in connect
    conn = self._new_conn()
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/urllib3/connection.py", line 182, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f38288e2908>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/urllib3/connectionpool.py", line 756, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/urllib3/util/retry.py", line 574, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Max retries exceeded with url: /microsoft/CodeGPT-small-java/1ce001a39943be8dc0ff6cf1ebd407608deb96b9b0a9bd1b45d786b1fc88e8ef (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f38288e2908>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/transformers/modeling_utils.py", line 1257, in from_pretrained
    user_agent=user_agent,
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/transformers/file_utils.py", line 1371, in cached_path
    local_files_only=local_files_only,
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/transformers/file_utils.py", line 1626, in get_from_cache
    http_get(url_to_download, temp_file, proxies=proxies, resume_size=resume_size, headers=headers)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/transformers/file_utils.py", line 1473, in http_get
    r = requests.get(url, stream=True, proxies=proxies, headers=headers)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Max retries exceeded with url: /microsoft/CodeGPT-small-java/1ce001a39943be8dc0ff6cf1ebd407608deb96b9b0a9bd1b45d786b1fc88e8ef (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f38288e2908>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run.py", line 653, in <module>
    main()
  File "run.py", line 615, in main
    model = model_class.from_pretrained(pretrained)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/transformers/modeling_utils.py", line 1266, in from_pretrained
    raise EnvironmentError(msg)
OSError: Can't load weights for 'microsoft/CodeGPT-small-java'. Make sure that:

- 'microsoft/CodeGPT-small-java' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'microsoft/CodeGPT-small-java' is the correct path to a directory containing a file named one of pytorch_model.bin, tf_model.h5, model.ckpt.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 152570) of binary: /home/linjiayi/anaconda3/envs/python36/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=1
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
  global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_mj9tvknq/none_b3oaj556/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_mj9tvknq/none_b3oaj556/attempt_1/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_mj9tvknq/none_b3oaj556/attempt_1/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_mj9tvknq/none_b3oaj556/attempt_1/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_mj9tvknq/none_b3oaj556/attempt_1/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_mj9tvknq/none_b3oaj556/attempt_1/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_mj9tvknq/none_b3oaj556/attempt_1/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_mj9tvknq/none_b3oaj556/attempt_1/7/error.json
local_rank: 2, node_index: 0, gpu_per_node: 8
local_rank: 1, node_index: 0, gpu_per_node: 8
local_rank: 5, node_index: 0, gpu_per_node: 8
local_rank: 7, node_index: 0, gpu_per_node: 8
local_rank: 0, node_index: 0, gpu_per_node: 8
local_rank: 6, node_index: 0, gpu_per_node: 8
local_rank: 3, node_index: 0, gpu_per_node: 8
local_rank: 4, node_index: 0, gpu_per_node: 8

The task is stuck at local_rank: 4, node_index: 0, and gpu_per_node: 8. Is it because the Microsoft/CodeGPT-small-Java model was not downloaded successfully? Actually I just changed PRETRAINDIR. When I use Microsoft/Codegpt-small-Java-AdaptedDGpT2 , I can fine tune it successfully. What should I do?

skye95git commented 2 years ago

I want to repre-train CodeGPT for code generation task from scratch. The description in CodeXGLUE/Code-Code/CodeCompletion-token/README.md shows:

We pre-train monolingual models respectively on Python and Java corpus from the CodeSearchNet dataset, which includes 1.1M Python functions and 1.6M Java methods. A function or method in training dataset consists function signature and function body. Some functions also contain NL docstrings.

I have a couple of questions about the pre-training data:

  1. Do you use Python and Java corpus directly from CodeSearchNet dataset? Do I need to preprocess the CodeSearchNet dataset? What else do I need to do besides add <s> and </s> at the beginning and end of the source code?
  2. What is the pre-trained input data format? Can you give me an example?
  3. Can the training code in run_lm.py be used for pre-training?
celbree commented 2 years ago

The test.json is CONCODE test set, and the test.txt is the corresponding predicted result. There is an error ValueError: Incompatible Language version 11. Must be between 13 and 13. What should I do?

You can try to re-build my-languages.so by running build.sh here using the current version of tree-sitter.

I have another question: Can I pass the Concode test set directly into the --refs parameter?

I don't think so. You need feed a .txt file for refs.

The task is stuck at local_rank: 4, node_index: 0, and gpu_per_node: 8. Is it because the Microsoft/CodeGPT-small-Java model was not downloaded successfully? Actually I just changed PRETRAINDIR. When I use Microsoft/Codegpt-small-Java-AdaptedDGpT2 , I can fine tune it successfully. What should I do?

I think it is a huggingface's transformers issue. But you can just use CodeGPT-small-java-adaptedGPT2 which performs better than CodeGPT-small-java.

  1. Do you use Python and Java corpus directly from CodeSearchNet dataset? Do I need to preprocess the CodeSearchNet dataset? What else do I need to do besides add \<s> and \</s> at the beginning and end of the source code?
  2. What is the pre-trained input data format? Can you give me an example?
  3. Can the training code in run_lm.py be used for pre-training?
  1. We removed the NL comments in source codes and ignored line-breaks and indent when preprocessing.
  2. The data format is just the code itself. e.g. <s> def function(a=0): return a </s>
  3. Yes. We used run_lm.py for pre-training, since code completion task shares the same training object with the casual language model training. We used TextDataset for loading pre-trained corpus.
skye95git commented 2 years ago

The test.json is CONCODE test set, and the test.txt is the corresponding predicted result. There is an error ValueError: Incompatible Language version 11. Must be between 13 and 13. What should I do?

You can try to re-build my-languages.so by running build.sh here using the current version of tree-sitter.

I have another question: Can I pass the Concode test set directly into the --refs parameter?

I don't think so. You need feed a .txt file for refs.

The task is stuck at local_rank: 4, node_index: 0, and gpu_per_node: 8. Is it because the Microsoft/CodeGPT-small-Java model was not downloaded successfully? Actually I just changed PRETRAINDIR. When I use Microsoft/Codegpt-small-Java-AdaptedDGpT2 , I can fine tune it successfully. What should I do?

I think it is a huggingface's transformers issue. But you can just use CodeGPT-small-java-adaptedGPT2 which performs better than CodeGPT-small-java.

  1. Do you use Python and Java corpus directly from CodeSearchNet dataset? Do I need to preprocess the CodeSearchNet dataset? What else do I need to do besides add and at the beginning and end of the source code?
  2. What is the pre-trained input data format? Can you give me an example?
  3. Can the training code in run_lm.py be used for pre-training?
  1. We removed the NL comments in source codes and ignored line-breaks and indent when preprocessing.
  2. The data format is just the code itself. e.g. <s> def function(a=0): return a </s>
  3. Yes. We used run_lm.py for pre-training, since code completion task shares the same training object with the casual language model training. We used TextDataset for loading pre-trained corpus.

Thanks for your reply! I have a few questions:

  1. The example of the pre-trained input data format you give is as follows: <s> def function(a=0): return a </s>

The data format of CodeSearchNet dataset is as follows:

 "original_string": "def get_vid_from_url(url):\n        \"\"\"Extracts video ID from URL.\n        \"\"\"\n        return match1(url, r'youtu\\.be/([^?/]+)') or \\\n          match1(url, r'youtube\\.com/embed/([^/?]+)') or \\\n          match1(url, r'youtube\\.com/v/([^/?]+)') or \\\n          match1(url, r'youtube\\.com/watch/([^/?]+)') or \\\n          parse_query_param(url, 'v') or \\\n          parse_query_param(parse_query_param(url, 'u'), 'v')"

According to the example you gave, is it necessary to remove the function name like concode during preprocessing?

  1. We pre-train monolingual models respectively on Python and Java corpus from the CodeSearchNet dataset, which includes 1.1M Python functions and 1.6M Java methods. To fine-tune CodeGPT on concode dataset for text2code generation on multi-GPUs at a single machine.

The concode dataset used for fine-tuning contains only the Java corpus. Can a pre-trained model codegpt-small-py-adaptedgpt2 generate both Python and Java code, or only Java code, after fine-tuning on the Java corpus?

  1. Can run_lm.py only be used to pre-train CodeGPT? Or can pre-training be implemented for both CodeGPT and CodeGPT-adapted?

  2. What should I be aware of if I repre-train CodeGPT-adapted? Do I need to pre-load GPT-2 as a starting point?

celbree commented 2 years ago
  1. No. Sorry to give a misunderstanding example. We don't remove any identifiers.
  2. CodeGPT was pre-trained on monolingual data. So CodeGPT-small-py can only used for python data.
  3. It can be used for both. Just setting args.pretrained_dir with gpt2, you can continue pre-train a model initialized with OpenAI-GPT2 checkpoint.
  4. Yes. Set args.pretrained_dir with gpt2
skye95git commented 2 years ago
  1. No. Sorry to give a misunderstanding example. We don't remove any identifiers.
  2. CodeGPT was pre-trained on monolingual data. So CodeGPT-small-py can only used for python data.
  3. It can be used for both. Just setting args.pretrained_dir with gpt2, you can continue pre-train a model initialized with OpenAI-GPT2 checkpoint.
  4. Yes. Set args.pretrained_dir with gpt2

For the second answer, the data set used to fine-tune and evaluate the code generation model is ConCode, which contains only the Java corpus. So can Codegpt-small-py generate Python code without fine-tuning on the concode dataset? How to evaluate the performance of CodegpT-Small-Py?

skye95git commented 2 years ago

We removed the NL comments in source codes and ignored line-breaks and indent when preprocessing. The data format is just the code itself.

Do single-line comments, multi-line comments, and document comments need to be removed from the source code?

Why do you remove the NL comments in source codes?

Does ignoring line-breaks and indent mean removing them? If yes, why do you remove them?

celbree commented 2 years ago

We remove all the comments in source codes as we want to pre-train a model focusing on code domain. About line-breaks and indent, we remove them to make input sequence shorter. we know that it may convey useful information but in our experiments, we found that we don't need preserve them in purpose as the pre-trained model can easily learn them from codes.

skye95git commented 2 years ago

The test.json is CONCODE test set, and the test.txt is the corresponding predicted result. There is an error ValueError: Incompatible Language version 11. Must be between 13 and 13. What should I do?

You can try to re-build my-languages.so by running build.sh here using the current version of tree-sitter.

I have another question: Can I pass the Concode test set directly into the --refs parameter?

I don't think so. You need feed a .txt file for refs.

The task is stuck at local_rank: 4, node_index: 0, and gpu_per_node: 8. Is it because the Microsoft/CodeGPT-small-Java model was not downloaded successfully? Actually I just changed PRETRAINDIR. When I use Microsoft/Codegpt-small-Java-AdaptedDGpT2 , I can fine tune it successfully. What should I do?

I think it is a huggingface's transformers issue. But you can just use CodeGPT-small-java-adaptedGPT2 which performs better than CodeGPT-small-java.

  1. Do you use Python and Java corpus directly from CodeSearchNet dataset? Do I need to preprocess the CodeSearchNet dataset? What else do I need to do besides add and at the beginning and end of the source code?
  2. What is the pre-trained input data format? Can you give me an example?
  3. Can the training code in run_lm.py be used for pre-training?
  1. We removed the NL comments in source codes and ignored line-breaks and indent when preprocessing.
  2. The data format is just the code itself. e.g. <s> def function(a=0): return a </s>
  3. Yes. We used run_lm.py for pre-training, since code completion task shares the same training object with the casual language model training. We used TextDataset for loading pre-trained corpus.

You can try to re-build my-languages.so by running build.sh here using the current version of tree-sitter.

After I re-build my-languages.so by running build.sh, I run:

python calc_code_bleu.py --refs /home/linjiayi/CodeXGLUE/Text-Code/text-to-code/evaluator/test.json --hyp /home/linjiayi/CodeXGLUE/Text-Code/text-to-code/evaluator/test.txt --lang java --params 0.25,0.25,0.25,0.25

The result is

WARNING: There is no reference data-flows extracted from the whole corpus, and the data-flow match score degenerates to 0. Please consider ignoring this score.
ngram match: 0.00542230945681549, weighted ngram match: 0.07657225779288454, syntax_match: 0.0, dataflow_match: 0
CodeBLEU score:  0.020498641812425007

The --refs parameter I set to test.json of concode, and the -- hyp parameter I set to CodeGPT's generated result on test set. The above results appear to be abnormal because my input parameters are wrong.

I don't think so. You need feed a .txt file for refs.

Should the -- refs parameter be set to a .txt file? What is the content of TXT file? Is it a reference value? What does refs mean? There is only one code for each NL in the Concode dataset:

{"code": "void function ( ScriptOrFnNode arg0 ) { int loc0 = - 1 ; collectFuncNodes ( arg0 , loc0 , null ) ; }", "nl": "generate mappings for each function node and parameters and variables names associated with it . concode_field_sep int parentScope concode_elem_sep ArrayList functionBracePositions concode_elem_sep ObjArray funcObjects concode_elem_sep int functionNum concode_elem_sep ArrayList functionVarMappings concode_elem_sep int lastTokenCount concode_elem_sep ArrayList replacedTokens concode_field_sep boolean isInScopeChain concode_elem_sep void reset concode_elem_sep void leaveNestingLevel concode_elem_sep String getMappedToken concode_elem_sep String getPreviousTokenMapping concode_elem_sep void collectFuncNodes concode_elem_sep int sourceCompress concode_elem_sep void enterNestingLevel"}

Where does the reference code list come from? Or is the contents of the TXT file the same code as conCode test set NL?

I have solved it. The --refs should be set to a file that holds the test set code field.

CodeGPT_img_6
skye95git commented 2 years ago

We remove all the comments in source codes as we want to pre-train a model focusing on code domain. About line-breaks and indent, we remove them to make input sequence shorter. we know that it may convey useful information but in our experiments, we found that we don't need preserve them in purpose as the pre-trained model can easily learn them from codes.

Thanks for your reply! Therefore, before pre-training, I need to preprocess the following code field of Codesearchnet data:

"def get_vid_from_url(url):\n        \"\"\"Extracts video ID from URL.\n        \"\"\"\n        return match1(url, r'youtu\\.be/([^?/]+)') or \\\n          match1(url, r'youtube\\.com/embed/([^/?]+)') or \\\n          match1(url, r'youtube\\.com/v/([^/?]+)') or \\\n          match1(url, r'youtube\\.com/watch/([^/?]+)') or \\\n          parse_query_param(url, 'v') or \\\n          parse_query_param(parse_query_param(url, 'u'), 'v')"

According to the following rules:

The result after preprocessing is as follow:

"<s> def get_vid_from_url(url):  return match1(url, r'youtu\\.be/([^?/]+)') or match1(url, r'youtube\\.com/embed/([^/?]+)') or match1(url, r'youtube\\.com/v/([^/?]+)') or match1(url, r'youtube\\.com/watch/([^/?]+)') or parse_query_param(url, 'v') or parse_query_param(parse_query_param(url, 'u'), 'v') </s>"

Save the preprocessing result as pre_train.jsonl. Is the -- data_dir parameter in run_lm.py set to pre_train.jsonl?

celbree commented 2 years ago

The data_dir is set to the directory where pre_train.jsonl is in. And I think you need modify the TextDataset to load data from file as you wish.

skye95git commented 2 years ago

The data_dir is set to the directory where pre_train.jsonl is in. And I think you need modify the TextDataset to load data from file as you wish.

Thanks for your reply! I will try it. I followed the readme steps to reproduce Text2Code Generation. I fine-tune with the following command:

LANG=java
DATADIR=/platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/dataset/concode
OUTPUTDIR=/platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/save/concode
PRETRAINDIR=microsoft/CodeGPT-small-java-adaptedGPT2
LOGFILE=text2code_concode.log
PER_NODE_GPU=8

python -m torch.distributed.launch --nproc_per_node=$PER_NODE_GPU run.py \
        --data_dir=$DATADIR \
        --langs=$LANG \
        --output_dir=$OUTPUTDIR \
        --pretrain_dir=$PRETRAINDIR \
        --log_file=$LOGFILE \
        --model_type=gpt2 \
        --block_size=512 \
        --do_train \
        --node_index 0 \
        --gpu_per_node $PER_NODE_GPU \
        --learning_rate=5e-5 \
        --weight_decay=0.01 \
        --evaluate_during_training \
        --per_gpu_train_batch_size=6 \
        --per_gpu_eval_batch_size=12 \
        --gradient_accumulation_steps=2 \
        --num_train_epochs=30 \
        --logging_steps=100 \
        --save_steps=5000 \
        --overwrite_output_dir \
        --seed=42

After fine-tuning I calculate BLEU and EM using the following command:

python evaluator/evaluator.py -a=evaluator/test.json -p=evaluator/test.txt

I calculate CodeBLEU with the following command:

python calc_code_bleu.py --refs /platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/save/CodeGPT-adapted/test.gold --hyp /platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/save/CodeGPT-adapted/test.output --lang java --params 0.25,0.25,0.25,0.25
Model | EM | BLUE | CodeBLUE -- | -- | -- | -- CodeGPT(yours) | 18.25 | 28.69 | 32.71 CodeGPT-adapted(yours) | 20.10 | 32.79 | 35.98 CodeGPT-adapted(mine) | 19.8 | 28.62 | 32.58

The evaluation results are shown in the table above. The gap was 4.17% for BLUE and 3.4% for CodeBLUE. Was there something wrong with my fine-tuning?

celbree commented 2 years ago

We only evaluate perplexity in dev set during training, but there's a gap between ppl and BLEU. So I suggest you choose the checkpoint which has the highest BLEU score on dev set. It brings improvement in our experiment.

skye95git commented 2 years ago

We only evaluate perplexity in dev set during training, but there's a gap between ppl and BLEU. So I suggest you choose the checkpoint which has the highest BLEU score on dev set. It brings improvement in our experiment.

In fact, I chose checkpoint, which has the highest BLEU score on the test set.

On dev set:

Model | EM | BLUE -- | -- | -- checkpoint-5000-1.9364 | 17.05 | 18.05 checkpoint-10000-2.0459 | 16.45 | 21.98 checkpoint-15000-2.1478 | 16.3 | 22.0 checkpoint-20000-2.2717 | 15.8 | 23.25 checkpoint-25000-2.3528 | 16.1 | 24.03 checkpoint-30000-2.3921 | 16.2 | **24.47** checkpoint-last | 16.2 | **24.47**

On test set:

Model | EM | BLUE | CodeBLUE -- | -- | -- | -- checkpoint-5000-1.9364 | 19.1 | 20.94 | 26.88 checkpoint-10000-2.0459 | 20.15 | 24.67 | 29.74 checkpoint-15000-2.1478 | 19.8 | 26.05 | 30.71 checkpoint-20000-2.2717 | 19.7 | 27.39 | 31.79 checkpoint-25000-2.3528 | 19.65 | 28.47 | 32.53 checkpoint-30000-2.3921 | 19.8 | 28.62 | 32.59 checkpoint-last | **19.8** | **28.62** | **32.60**

Compared:

Model | EM | BLUE | CodeBLUE -- | -- | -- | -- CodeGPT(yours) | 18.25 | 28.69 | 32.71 CodeGPT-adapted(yours) | 20.10 | 32.79 | 35.98 CodeGPT-adapted(mine) | 19.8 | 28.62 | 32.60

There is still a gap between my results and yours.

celbree commented 2 years ago

It seems that the BLEU score in dev set is still increasing, which indicts that the training process hasn't overfit. Maybe continue training this model would bring more improvements.

skye95git commented 2 years ago

It seems that the BLEU score in dev set is still increasing, which indicts that the training process hasn't overfit. Maybe continue training this model would bring more improvements.

Thanks. I will try it.

Model | EM | BLEU | CodeBLEU -- | -- | -- | -- GPT-2 | 17.35 | 25.37 | 29.69 CodeGPT | 18.25 | 28.69 | 32.71

CodeGPT shares the same model architecture and training object with GPT-2, consisting 12 layers of Transformer decoders.

I wonder what is the difference between GPT-2 and CodeGPT in the table. As described in the readme, their model structure is the same. Why is the performance of the model different when the training data are the same?

skye95git commented 2 years ago

The data_dir is set to the directory where pre_train.jsonl is in. And I think you need modify the TextDataset to load data from file as you wish.

I have modified the TextDataset. Then I run the command to pretrain the adaptedGPT2:

LANG=java                       
DATADIR=/platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/dataset/CodeSearchNet/java
OUTPUTDIR=/platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/save/pretrain_adaptedGPT2
PRETRAINDIR=gpt2  
LOGFILE=CodeSearchNet_pretrain_adaptedGPT2_a100_8gpu.log
PER_NODE_GPU=8       

python -m torch.distributed.launch --nproc_per_node=$PER_NODE_GPU run_lm.py \
        --data_dir=$DATADIR \
        --langs=$LANG \
        --output_dir=$OUTPUTDIR \
        --pretrain_dir=$PRETRAINDIR \
        --log_file=$LOGFILE \
        --model_type=gpt2 \
        --block_size=1024 \
        --do_train \
        --gpu_per_node $PER_NODE_GPU \
        --learning_rate=4e-4 \
        --weight_decay=0.01 \
        --evaluate_during_training \
        --per_gpu_train_batch_size=8 \
        --per_gpu_eval_batch_size=8 \
        --gradient_accumulation_steps=4 \
        --num_train_epochs=20 \
        --logging_steps=100 \
        --save_steps=1000 \
        --seed=42 \
        --overwrite_output_dir \

There is an error:

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/elastic/utils/store.py:53: FutureWarning: This is an experimental API and will be changed in future.
  "This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
  global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_0/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_0/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_0/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_0/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_0/7/error.json
Traceback (most recent call last):
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
  File "run_lm.py", line 717, in <module>
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
    main()
  File "run_lm.py", line 620, in main
    main()
  File "run_lm.py", line 620, in main
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2097714) of binary: /home/linjiayi/anaconda3/envs/python36/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=1
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
  global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_1/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_1/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_1/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_1/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_1/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_1/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_1/7/error.json
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2097920) of binary: /home/linjiayi/anaconda3/envs/python36/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 2/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=2
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
  global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_2/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_2/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_2/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_2/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_2/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_2/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_2/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_2/7/error.json
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2098017) of binary: /home/linjiayi/anaconda3/envs/python36/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 1/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=3
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
  global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_3/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_3/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_3/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_3/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_3/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_3/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_3/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_3/7/error.json
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
  File "run_lm.py", line 717, in <module>
    main()
  File "run_lm.py", line 620, in main
    torch.cuda.set_device(args.local_rank)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2098190) of binary: /home/linjiayi/anaconda3/envs/python36/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/elastic/utils/store.py:71: FutureWarning: This is an experimental API and will be changed in future.
  "This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.00029754638671875 seconds
{"name": "torchelastic.worker.status.TERMINATED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "2098189", "role": "default", "hostname": "dgx078.scc.idea", "state": "TERMINATED", "total_run_time": 21, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 1, "group_rank": 0, "worker_id": "2098190", "role": "default", "hostname": "dgx078.scc.idea", "state": "FAILED", "total_run_time": 21, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [1], \"role_rank\": [1], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 2, "group_rank": 0, "worker_id": "2098191", "role": "default", "hostname": "dgx078.scc.idea", "state": "FAILED", "total_run_time": 21, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [2], \"role_rank\": [2], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 3, "group_rank": 0, "worker_id": "2098192", "role": "default", "hostname": "dgx078.scc.idea", "state": "FAILED", "total_run_time": 21, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [3], \"role_rank\": [3], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 4, "group_rank": 0, "worker_id": "2098193", "role": "default", "hostname": "dgx078.scc.idea", "state": "FAILED", "total_run_time": 21, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [4], \"role_rank\": [4], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 5, "group_rank": 0, "worker_id": "2098194", "role": "default", "hostname": "dgx078.scc.idea", "state": "FAILED", "total_run_time": 21, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [5], \"role_rank\": [5], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 6, "group_rank": 0, "worker_id": "2098195", "role": "default", "hostname": "dgx078.scc.idea", "state": "FAILED", "total_run_time": 21, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [6], \"role_rank\": [6], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 7, "group_rank": 0, "worker_id": "2098196", "role": "default", "hostname": "dgx078.scc.idea", "state": "FAILED", "total_run_time": 21, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [7], \"role_rank\": [7], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "dgx078.scc.idea", "state": "SUCCEEDED", "total_run_time": 21, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 3}}
/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py:354: UserWarning: 

**********************************************************************
               CHILD PROCESS FAILED WITH NO ERROR_FILE                
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 2098190 (local_rank 1) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

  from torch.distributed.elastic.multiprocessing.errors import record

  @record
  def trainer_main(args):
     # do train
**********************************************************************
  warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/launch.py", line 173, in <module>
    main()
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/launch.py", line 169, in main
    run(args)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/run.py", line 624, in run
    )(*cmd_args)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
***************************************
            run_lm.py FAILED           
=======================================
Root Cause:
[0]:
  time: 2021-10-18_09:48:38
  rank: 1 (local_rank: 1)
  exitcode: 1 (pid: 2098190)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
=======================================
Other Failures:
[1]:
  time: 2021-10-18_09:48:38
  rank: 2 (local_rank: 2)
  exitcode: 1 (pid: 2098191)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
[2]:
  time: 2021-10-18_09:48:38
  rank: 3 (local_rank: 3)
  exitcode: 1 (pid: 2098192)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
[3]:
  time: 2021-10-18_09:48:38
  rank: 4 (local_rank: 4)
  exitcode: 1 (pid: 2098193)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
[4]:
  time: 2021-10-18_09:48:38
  rank: 5 (local_rank: 5)
  exitcode: 1 (pid: 2098194)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
[5]:
  time: 2021-10-18_09:48:38
  rank: 6 (local_rank: 6)
  exitcode: 1 (pid: 2098195)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
[6]:
  time: 2021-10-18_09:48:38
  rank: 7 (local_rank: 7)
  exitcode: 1 (pid: 2098196)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
***************************************

It appears to be a distributed training exception. What should I do?

skye95git commented 2 years ago

It seems that the BLEU score in dev set is still increasing, which indicts that the training process hasn't overfit. Maybe continue training this model would bring more improvements.

After I set --num_train_epochs=60, I get the latest result:

Model | EM | BLUE | CodeBLUE -- | -- | -- | -- CodeGPT(yours) | 18.25 | 28.69 | 32.71 CodeGPT-adapted(yours) | 20.10 | 32.79 | 35.98 CodeGPT-adapted(mine) | **20.15**| **33.56** | **36.57**
skye95git commented 2 years ago
  1. No. Sorry to give a misunderstanding example. We don't remove any identifiers.
  2. CodeGPT was pre-trained on monolingual data. So CodeGPT-small-py can only used for python data.
  3. It can be used for both. Just setting args.pretrained_dir with gpt2, you can continue pre-train a model initialized with OpenAI-GPT2 checkpoint.
  4. Yes. Set args.pretrained_dir with gpt2

If I want to pre-train adaptedGPT2, I should set args.pretrained_dir with gpt2. If I want to pre-train CodeGPT from scratch, what should I do? Do I need to set the args.pretrained_dir parameter?

celbree commented 2 years ago

I wonder what is the difference between GPT-2 and CodeGPT in the table. As described in the readme, their model structure is the same. Why is the performance of the model different when the training data are the same?

The pre-trained data are different. GPT-2 model is not pre-trained on code domain dataset but CodeGPT is pre-trained on CodeSearchNet.

It appears to be a distributed training exception. What should I do?

I'm not quite sure how to solve this error. As far as I concern, if you want to run on a single node with 8GPUs, you also need set the param node_index as 0 because it is -1 by default in our codes. If this error is still, maybe you can raise an issue on Pytorch forum.

If I want to pre-train adaptedGPT2, I should set args.pretrained_dir with gpt2. If I want to pre-train CodeGPT from scratch, what should I do? Do I need to set the args.pretrained_dir parameter?

Just leave it as default. The code will run into this branch (https://github.com/microsoft/CodeXGLUE/blob/main/Code-Code/CodeCompletion-token/code/run_lm.py#L679) and randomly initialized the model.

skye95git commented 2 years ago

I wonder what is the difference between GPT-2 and CodeGPT in the table. As described in the readme, their model structure is the same. Why is the performance of the model different when the training data are the same?

The pre-trained data are different. GPT-2 model is not pre-trained on code domain dataset but CodeGPT is pre-trained on CodeSearchNet.

It appears to be a distributed training exception. What should I do?

I'm not quite sure how to solve this error. As far as I concern, if you want to run on a single node with 8GPUs, you also need set the param node_index as 0 because it is -1 by default in our codes. If this error is still, maybe you can raise an issue on Pytorch forum.

If I want to pre-train adaptedGPT2, I should set args.pretrained_dir with gpt2. If I want to pre-train CodeGPT from scratch, what should I do? Do I need to set the args.pretrained_dir parameter?

Just leave it as default. The code will run into this branch (https://github.com/microsoft/CodeXGLUE/blob/main/Code-Code/CodeCompletion-token/code/run_lm.py#L679) and randomly initialized the model.

Thanks for the details. For pre-training CodeGPT from scratch, are these hyper-parameters the same as the ones used in the pre-train adaptedGPT2 scripts, or in the fine-tuning scripts? number of epochs, learning rate, per_gpu_train_batch_size, weight decay, gradient acc steps, optimizer

skye95git commented 2 years ago

I have finished the pretrain and fine-tune of adaptedGPT2. When I run evaluation on dev set on single GPU:

export CUDA_VISIBLE_DEVICES=1
LANG=java
DATADIR=/platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/dataset/concode
OUTPUTDIR=/platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/save/concode/checkpoint-60000
PRETRAINDIR=/platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/save/concode/checkpoint-60000-2.9975
LOGFILE=text2code_concode_eval_1.log

python -u run.py \
        --data_dir=$DATADIR \
        --langs=$LANG \
        --output_dir=$OUTPUTDIR \
        --pretrain_dir=$PRETRAINDIR \
        --log_file=$LOGFILE \
        --model_type=gpt2 \
        --block_size=512 \
        --do_eval \
        --logging_steps=100 \
        --seed=42

there is an error:

Traceback (most recent call last):
  File "run.py", line 653, in <module>
    main()
  File "run.py", line 644, in main
    dev_bleu, dev_EM = eval_bleu(args, model, tokenizer, file_type='dev', num=2000)
  File "run.py", line 357, in eval_bleu
    for step, (batch, token_labels) in enumerate(test_dataloader):
  File "/home/linjiayi/anaconda3/envs/Deepcs/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/home/linjiayi/anaconda3/envs/Deepcs/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 561, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/linjiayi/anaconda3/envs/Deepcs/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/linjiayi/anaconda3/envs/Deepcs/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/linjiayi/CodeXGLUE/Text-Code/text-to-code/code/dataset.py", line 116, in __getitem__
    return torch.tensor(self.inputs[item]), torch.tensor(self.token_labels[item])
RuntimeError: Could not infer dtype of NoneType

It is very strange. In fact, I used the same script as above when I evaluated with the model you provided, but with a different PRETRAINDIR parameter. It can be evaluated using the pre-training model you provide.

skye95git commented 2 years ago

When I pretrain the CodeGPT from scratch, I set parameter args.pretrained_dir as the default as you suggested. I run:

LANG=java                       
DATADIR=/platform_tech/linjiayi/CodeXGLUE/Code-Code/CodeCompletion-token/dataset/CodeSearchNet/java
OUTPUTDIR=/platform_tech/linjiayi/CodeXGLUE/Code-Code/CodeCompletion-token/save/pretrain_GPT2
LOGFILE=CodeSearchNet_pretrain_GPT2_a100_8gpu.log
PER_NODE_GPU=8      

python -m torch.distributed.launch --nproc_per_node=$PER_NODE_GPU run_lm_pretrain.py \
        --data_dir=$DATADIR \
        --langs=$LANG \
        --output_dir=$OUTPUTDIR \
        --log_file=$LOGFILE \
        --model_type=gpt2 \
        --block_size=1024 \
        --do_train \
        --gpu_per_node $PER_NODE_GPU \
        --learning_rate=4e-4 \
        --weight_decay=0.01 \
        --evaluate_during_training \
        --per_gpu_train_batch_size=8 \
        --per_gpu_eval_batch_size=8 \
        --gradient_accumulation_steps=4 \
        --num_train_epochs=20 \
        --logging_steps=100 \
        --save_steps=1000 \
        --overwrite_output_dir \
        --seed=42 \

There is an error

Start slurm job at Wed 20 Oct 2021 02:13:13 PM CST
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases.
 Please read local_rank from `os.environ('LOCAL_RANK')` instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : run_lm_pretrain.py
  min_nodes        : 1
  max_nodes        : 1
  nproc_per_node   : 8
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 127.0.0.1:29500
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/elastic/utils/store.py:53: FutureWarning: This is an experimental API and will be changed in future.
  "This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
  global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_0/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_0/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_0/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_0/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_0/7/error.json
10/20/2021 14:13:32 - WARNING - __main__ -   Process rank: -8, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: False, world size: 8
10/20/2021 14:13:32 - WARNING - __main__ -   Process rank: -3, device: cuda:5, n_gpu: 1, distributed training: True, 16-bits training: False, world size: 8
10/20/2021 14:13:32 - WARNING - __main__ -   Process rank: -5, device: cuda:3, n_gpu: 1, distributed training: True, 16-bits training: False, world size: 8
10/20/2021 14:13:32 - WARNING - __main__ -   Process rank: -2, device: cuda:6, n_gpu: 1, distributed training: True, 16-bits training: False, world size: 8
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 5 using best-guess GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 6 using best-guess GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
10/20/2021 14:13:32 - WARNING - __main__ -   Process rank: -6, device: cuda:2, n_gpu: 1, distributed training: True, 16-bits training: False, world size: 8
10/20/2021 14:13:32 - WARNING - __main__ -   Process rank: -7, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: False, world size: 8
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
10/20/2021 14:13:32 - WARNING - __main__ -   Process rank: -4, device: cuda:4, n_gpu: 1, distributed training: True, 16-bits training: False, world size: 8
[W ProcessGroupNCCL.cpp:1569] Rank 4 using best-guess GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
10/20/2021 14:13:32 - WARNING - __main__ -   Process rank: -1, device: cuda:7, n_gpu: 1, distributed training: False, 16-bits training: False, world size: 1
Traceback (most recent call last):
  File "run_lm_pretrain.py", line 731, in <module>
    main()
  File "run_lm_pretrain.py", line 697, in main
    tokenizer = tokenizer_class.from_pretrained(args.tokenizer_dir, sep_token='<EOL>', bos_token='<s>', eos_token='</s>', pad_token='<pad>', unk_token='<|UNKNOWN|>')
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1648, in from_pretrained
    pretrained_model_name_or_path, revision=revision, use_auth_token=use_auth_token
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 3408, in get_fast_tokenizer_file
    all_files = get_list_of_files(path_or_repo, revision=revision, use_auth_token=use_auth_token)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/transformers/file_utils.py", line 1687, in get_list_of_files
    path_or_repo, revision=revision, token=token
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/huggingface_hub/hf_api.py", line 248, in model_info
    r.raise_for_status()
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/requests/models.py", line 953, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/None
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 7 (pid: 1383506) of binary: /home/linjiayi/anaconda3/envs/python36/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=1
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
  global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_1/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_1/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_1/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_1/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_1/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_1/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_1/7/error.json
Traceback (most recent call last):
  File "run_lm_pretrain.py", line 731, in <module>
    main()
  File "run_lm_pretrain.py", line 629, in main
    torch.distributed.init_process_group(backend='nccl')
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=8, worker_count=16, timeout=0:30:00)
Traceback (most recent call last):
  File "run_lm_pretrain.py", line 731, in <module>
    main()
  File "run_lm_pretrain.py", line 629, in main
    torch.distributed.init_process_group(backend='nccl')
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=8, worker_count=16, timeout=0:30:00)
Traceback (most recent call last):
  File "run_lm_pretrain.py", line 731, in <module>
    main()
  File "run_lm_pretrain.py", line 629, in main
    torch.distributed.init_process_group(backend='nccl')
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 7, for key: store_based_barrier_key:1 (world_size=8, worker_count=16, timeout=0:30:00)
Traceback (most recent call last):
  File "run_lm_pretrain.py", line 731, in <module>
    main()
  File "run_lm_pretrain.py", line 629, in main
    torch.distributed.init_process_group(backend='nccl')
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 5, for key: store_based_barrier_key:1 (world_size=8, worker_count=16, timeout=0:30:00)
Traceback (most recent call last):
  File "run_lm_pretrain.py", line 731, in <module>
    main()
  File "run_lm_pretrain.py", line 629, in main
    torch.distributed.init_process_group(backend='nccl')
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 4, for key: store_based_barrier_key:1 (world_size=8, worker_count=16, timeout=0:30:00)
Traceback (most recent call last):
Traceback (most recent call last):
  File "run_lm_pretrain.py", line 731, in <module>
  File "run_lm_pretrain.py", line 731, in <module>
    main()
  File "run_lm_pretrain.py", line 629, in main
    main()
  File "run_lm_pretrain.py", line 629, in main
    torch.distributed.init_process_group(backend='nccl')
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    torch.distributed.init_process_group(backend='nccl')
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
    _store_based_barrier(rank, store, timeout)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 6, for key: store_based_barrier_key:1 (world_size=8, worker_count=16, timeout=0:30:00)
    rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 3, for key: store_based_barrier_key:1 (world_size=8, worker_count=16, timeout=0:30:00)
Traceback (most recent call last):
  File "run_lm_pretrain.py", line 731, in <module>
    main()
  File "run_lm_pretrain.py", line 629, in main
    torch.distributed.init_process_group(backend='nccl')
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=8, worker_count=16, timeout=0:30:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1383747) of binary: /home/linjiayi/anaconda3/envs/python36/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 2/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=2
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
  global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_2/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_2/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_2/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_2/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_2/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_2/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_2/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_2/7/error.json
Traceback (most recent call last):
  File "run_lm_pretrain.py", line 731, in <module>
    main()
  File "run_lm_pretrain.py", line 629, in main
    torch.distributed.init_process_group(backend='nccl')
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 5, for key: store_based_barrier_key:1 (world_size=8, worker_count=24, timeout=0:30:00)
Traceback (most recent call last):
Traceback (most recent call last):
  File "run_lm_pretrain.py", line 731, in <module>
  File "run_lm_pretrain.py", line 731, in <module>
    main()
  File "run_lm_pretrain.py", line 629, in main
    main()
  File "run_lm_pretrain.py", line 629, in main
    torch.distributed.init_process_group(backend='nccl')
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    torch.distributed.init_process_group(backend='nccl')
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
    _store_based_barrier(rank, store, timeout)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 4, for key: store_based_barrier_key:1 (world_size=8, worker_count=24, timeout=0:30:00)
    rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=8, worker_count=24, timeout=0:30:00)
Traceback (most recent call last):
  File "run_lm_pretrain.py", line 731, in <module>
    main()
  File "run_lm_pretrain.py", line 629, in main
    torch.distributed.init_process_group(backend='nccl')
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 3, for key: store_based_barrier_key:1 (world_size=8, worker_count=24, timeout=0:30:00)
Traceback (most recent call last):
Traceback (most recent call last):
  File "run_lm_pretrain.py", line 731, in <module>
  File "run_lm_pretrain.py", line 731, in <module>
    main()
  File "run_lm_pretrain.py", line 629, in main
    main()
  File "run_lm_pretrain.py", line 629, in main
    torch.distributed.init_process_group(backend='nccl')
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    torch.distributed.init_process_group(backend='nccl')
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
    _store_based_barrier(rank, store, timeout)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 6, for key: store_based_barrier_key:1 (world_size=8, worker_count=24, timeout=0:30:00)
    rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=8, worker_count=24, timeout=0:30:00)
Traceback (most recent call last):
  File "run_lm_pretrain.py", line 731, in <module>
Traceback (most recent call last):
  File "run_lm_pretrain.py", line 731, in <module>
    main()
  File "run_lm_pretrain.py", line 629, in main
    main()
  File "run_lm_pretrain.py", line 629, in main
    torch.distributed.init_process_group(backend='nccl')
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    torch.distributed.init_process_group(backend='nccl')
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
    _store_based_barrier(rank, store, timeout)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 7, for key: store_based_barrier_key:1 (world_size=8, worker_count=24, timeout=0:30:00)
    rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=8, worker_count=24, timeout=0:30:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1411598) of binary: /home/linjiayi/anaconda3/envs/python36/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 1/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=3
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
  role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
  global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_3/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_3/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_3/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_3/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_3/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_3/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_3/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_3/7/error.json
Traceback (most recent call last):
  File "run_lm_pretrain.py", line 731, in <module>
    main()
  File "run_lm_pretrain.py", line 629, in main
    torch.distributed.init_process_group(backend='nccl')
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 3, for key: store_based_barrier_key:1 (world_size=8, worker_count=32, timeout=0:30:00)
Traceback (most recent call last):
  File "run_lm_pretrain.py", line 731, in <module>
    main()
  File "run_lm_pretrain.py", line 629, in main
    torch.distributed.init_process_group(backend='nccl')
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=8, worker_count=32, timeout=0:30:00)
Traceback (most recent call last):
  File "run_lm_pretrain.py", line 731, in <module>
    main()
  File "run_lm_pretrain.py", line 629, in main
    torch.distributed.init_process_group(backend='nccl')
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 4, for key: store_based_barrier_key:1 (world_size=8, worker_count=32, timeout=0:30:00)
Traceback (most recent call last):
  File "run_lm_pretrain.py", line 731, in <module>
    main()
  File "run_lm_pretrain.py", line 629, in main
    torch.distributed.init_process_group(backend='nccl')
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 7, for key: store_based_barrier_key:1 (world_size=8, worker_count=32, timeout=0:30:00)
Traceback (most recent call last):
  File "run_lm_pretrain.py", line 731, in <module>
    main()
  File "run_lm_pretrain.py", line 629, in main
    torch.distributed.init_process_group(backend='nccl')
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=8, worker_count=32, timeout=0:30:00)
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "run_lm_pretrain.py", line 731, in <module>
  File "run_lm_pretrain.py", line 731, in <module>
  File "run_lm_pretrain.py", line 731, in <module>
    main()
  File "run_lm_pretrain.py", line 629, in main
    main()
  File "run_lm_pretrain.py", line 629, in main
    main()
  File "run_lm_pretrain.py", line 629, in main
    torch.distributed.init_process_group(backend='nccl')
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    torch.distributed.init_process_group(backend='nccl')
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    torch.distributed.init_process_group(backend='nccl')
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
    _store_based_barrier(rank, store, timeout)
    _store_based_barrier(rank, store, timeout)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
    rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 6, for key: store_based_barrier_key:1 (world_size=8, worker_count=32, timeout=0:30:00)
    rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=8, worker_count=32, timeout=0:30:00)
    rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 5, for key: store_based_barrier_key:1 (world_size=8, worker_count=32, timeout=0:30:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1439177) of binary: /home/linjiayi/anaconda3/envs/python36/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/elastic/utils/store.py:71: FutureWarning: This is an experimental API and will be changed in future.
  "This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0003800392150878906 seconds
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "1439177", "role": "default", "hostname": "dgx018.scc.idea", "state": "FAILED", "total_run_time": 5448, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 1, "group_rank": 0, "worker_id": "1439178", "role": "default", "hostname": "dgx018.scc.idea", "state": "FAILED", "total_run_time": 5448, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [1], \"role_rank\": [1], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 2, "group_rank": 0, "worker_id": "1439179", "role": "default", "hostname": "dgx018.scc.idea", "state": "FAILED", "total_run_time": 5448, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [2], \"role_rank\": [2], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 3, "group_rank": 0, "worker_id": "1439181", "role": "default", "hostname": "dgx018.scc.idea", "state": "FAILED", "total_run_time": 5448, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [3], \"role_rank\": [3], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 4, "group_rank": 0, "worker_id": "1439185", "role": "default", "hostname": "dgx018.scc.idea", "state": "FAILED", "total_run_time": 5448, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [4], \"role_rank\": [4], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 5, "group_rank": 0, "worker_id": "1439187", "role": "default", "hostname": "dgx018.scc.idea", "state": "FAILED", "total_run_time": 5448, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [5], \"role_rank\": [5], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 6, "group_rank": 0, "worker_id": "1439190", "role": "default", "hostname": "dgx018.scc.idea", "state": "FAILED", "total_run_time": 5448, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [6], \"role_rank\": [6], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 7, "group_rank": 0, "worker_id": "1439192", "role": "default", "hostname": "dgx018.scc.idea", "state": "FAILED", "total_run_time": 5448, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [7], \"role_rank\": [7], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "dgx018.scc.idea", "state": "SUCCEEDED", "total_run_time": 5448, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 3}}
/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py:354: UserWarning: 

**********************************************************************
               CHILD PROCESS FAILED WITH NO ERROR_FILE                
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 1439177 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:

  from torch.distributed.elastic.multiprocessing.errors import record

  @record
  def trainer_main(args):
     # do train
**********************************************************************
  warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/launch.py", line 173, in <module>
    main()
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/launch.py", line 169, in main
    run(args)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/run.py", line 624, in run
    )(*cmd_args)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
***************************************
       run_lm_pretrain.py FAILED       
=======================================
Root Cause:
[0]:
  time: 2021-10-20_15:44:16
  rank: 0 (local_rank: 0)
  exitcode: 1 (pid: 1439177)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
=======================================
Other Failures:
[1]:
  time: 2021-10-20_15:44:16
  rank: 1 (local_rank: 1)
  exitcode: 1 (pid: 1439178)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
[2]:
  time: 2021-10-20_15:44:16
  rank: 2 (local_rank: 2)
  exitcode: 1 (pid: 1439179)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
[3]:
  time: 2021-10-20_15:44:16
  rank: 3 (local_rank: 3)
  exitcode: 1 (pid: 1439181)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
[4]:
  time: 2021-10-20_15:44:16
  rank: 4 (local_rank: 4)
  exitcode: 1 (pid: 1439185)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
[5]:
  time: 2021-10-20_15:44:16
  rank: 5 (local_rank: 5)
  exitcode: 1 (pid: 1439187)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
[6]:
  time: 2021-10-20_15:44:16
  rank: 6 (local_rank: 6)
  exitcode: 1 (pid: 1439190)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
[7]:
  time: 2021-10-20_15:44:16
  rank: 7 (local_rank: 7)
  exitcode: 1 (pid: 1439192)
  error_file: <N/A>
  msg: "Process failed with exitcode 1"
***************************************

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
End slurm job at Wed Oct 20 15:44:16 CST 2021

The mistakes seems to be requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/None and RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=8, worker_count=16, timeout=0:30:00).

I pretrain the adaptedGPT2 with the same script, just set args.pretrained_dir with gpt2:

LANG=java                       
DATADIR=/platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/dataset/CodeSearchNet/java
OUTPUTDIR=/platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/save/pretrain_adaptedGPT2
PRETRAINDIR=gpt2 
LOGFILE=CodeSearchNet_pretrain_adaptedGPT2_a100_8gpu.log
PER_NODE_GPU=8      

python -m torch.distributed.launch --nproc_per_node=$PER_NODE_GPU run_lm_pretrain.py \
        --data_dir=$DATADIR \
        --langs=$LANG \
        --output_dir=$OUTPUTDIR \
        --pretrain_dir=$PRETRAINDIR \
        --log_file=$LOGFILE \
        --model_type=gpt2 \
        --block_size=1024 \
        --do_train \
        --gpu_per_node $PER_NODE_GPU \
        --learning_rate=4e-4 \
        --weight_decay=0.01 \
        --evaluate_during_training \
        --per_gpu_train_batch_size=8 \
        --per_gpu_eval_batch_size=8 \
        --gradient_accumulation_steps=4 \
        --num_train_epochs=20 \
        --logging_steps=100 \
        --save_steps=1000 \
        --overwrite_output_dir \
        --seed=42 \

When I pre-trained adaptedGPT2, I did not encounter the error of process group initialization failure.

celbree commented 2 years ago

It is very strange. In fact, I used the same script as above when I evaluated with the model you provided, but with a different PRETRAINDIR parameter. It can be evaluated using the pre-training model you provide.

For this error, one possible reason is the different tokenizer. For example, if your tokenizer doesn't have a special token named bos_token, then when you call tokenizer.bos_token_id it will return None. Considering that the tokenizer used for pre-training and fine-tune on CONCODE are different. You may want to double check whether you have add the special tokens when fine-tuning.

The mistakes seems to be requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/None and RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=8, worker_count=16, timeout=0:30:00).

To pre-train CodeGPT from scratch, you need to feed args.tokenizer_dir and args.config_dir with microsoft/CodeGPT-xx so that the tokenizer can be loaded in this code. And I also notice that you didn't set the node_index param. When you running in single machine, please set node_index as 0 otherwise the local_rank will be negative.

skye95git commented 2 years ago

@celbree Thanks for your reply!

  1. If the pretrained and fine-tuned tokenizers are different, can fine-tuning also succeed?

    Considering that the tokenizer used for pre-training and fine-tune on CONCODE are different.

I have done pretrain and fine-tune. Errors are reported only when evaluating. Pretrain result:

CodeGPT_img_10

fine-tune result:

CodeGPT_img_12

An evaluation error was reported on the dev set:

RuntimeError: Could not infer dtype of NoneType
  1. Why are args.tokenizer_dir and args.config_dirset to microsoft/CodeGPT-small-java instead of gpt2? Is it because the vocabulary of gpt2 is about text, while the vocabulary of microsoft/CodeGPT-small-java is about code? I looked it up on the Internet:

    # OpenAI GPT-2
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    model = GPT2Model.from_pretrained('gpt2')
  2. For pre-training CodeGPT from scratch, are these hyper-parameters the same as the ones used in the pre-train adaptedGPT2 scripts, or in the fine-tuning scripts? number of epochs, learning rate, per_gpu_train_batch_size, weight decay, gradient acc steps, optimizer

celbree commented 2 years ago
  1. The tokenizer is almost the same but with a slight difference because we add several special tokens like concode_elem_sep which would make the vocab size on fine-tuning larger than pre-training. About the NoneType error, it might because the tokenizer return None when you call a non-exist special token as I said before. So could you please try to print the token ids when the error happens.

  2. Because CodeGPT and CodeGPT-adapted are using totally difference tokenizer. We train a new BPE tokenizer on code domain for pre-training CodeGPT. But for CodeGPT-adapted, we just re-use the tokenizer in OpenAI GPT-2. So if you want to pre-train CodeGPT from scratch, you can either use our trained BPE tokenizer by setting microsoft/CodeGPT-small-xx or train a new tokenizer by yourself.

  3. We increase the number of epochs and keep other hyper-parameters the same.

skye95git commented 2 years ago
  1. The tokenizer is almost the same but with a slight difference because we add several special tokens like concode_elem_sep which would make the vocab size on fine-tuning larger than pre-training. About the NoneType error, it might because the tokenizer return None when you call a non-exist special token as I said before. So could you please try to print the token ids when the error happens.
  2. Because CodeGPT and CodeGPT-adapted are using totally difference tokenizer. We train a new BPE tokenizer on code domain for pre-training CodeGPT. But for CodeGPT-adapted, we just re-use the tokenizer in OpenAI GPT-2. So if you want to pre-train CodeGPT from scratch, you can either use our trained BPE tokenizer by setting microsoft/CodeGPT-small-xx or train a new tokenizer by yourself.
  3. We increase the number of epochs and keep other hyper-parameters the same.

Thanks for your reply! After I manually add "concode_elem_sep": 50261 into added_tokens.json, I solved the problem Could not infer dtype of NoneType. I have a few questions:

  1. After fine-tuning the pre-training model microsoft/CodeGPT-small-java-adaptedGPT2 you provided, the resulting file added_tokens.json is as follows
    {"<pad>": 50257, "</s>": 50258, "<s>": 50259, "concode_elem_sep": 50260}

After fine-tuning my own pre-trained model adaptedGPT2, the resulting file added_tokens.json is as follows:

{"<s>": 50257, "</s>": 50258, "<EOL>": 50259, "<pad>": 50260}

Special token concode_elem_sep is missing from my result. But I have set PRETRAINDIR to a checkpoint generated by pre-training:

PRETRAINDIR=/platform_tech/linjiayi/CodeXGLUE/Code-Code/CodeCompletion-token/save/pretrain_adaptedGPT2/checkpoint-last

I also added concode_elem_sep to tokenizer as a special token: https://github.com/microsoft/CodeXGLUE/blob/ae1d06f5505b3f71b6e1be36ee26028f17c09994/Text-Code/text-to-code/code/run.py#L613

I run the command:

if args.do_eval:            # only works on 1 GPU
        print(tokenizer.bos_token) 
        print(tokenizer.sep_token)
        print(tokenizer.encode('<s>'))
        print(tokenizer.encode('concode_elem_sep', add_special_tokens=True))

The result is:

<s>
concode_elem_sep
[50257]
[None]

Why is there no concode_elem_sep in the added_tokens.json file generated after fine-tuning my own pre-trained model? Is it appropriate to manually add concode_elem_sep to added_tokens.json file as I described above?

  1. We train a new BPE tokenizer on code domain for pre-training CodeGPT. So if you want to pre-train CodeGPT from scratch, you can either use our trained BPE tokenizer by setting microsoft/CodeGPT-small-xx or train a new tokenizer by yourself.

I downloaded the vocab.json from microsoft/CodeGPT-small-java. I found <s>, <pad> and </s> were already stored in it.

"<s>": 0,
"<pad>": 1,
"</s>": 2,

Why do I still need to add these special tokens when I'm pre-training from scratch? https://github.com/microsoft/CodeXGLUE/blob/ae1d06f5505b3f71b6e1be36ee26028f17c09994/Text-Code/text-to-code/code/run.py#L613

  1. How to train a new BPE tokenizer on code domain? If I pre-trained in different programming languages, do I need to retrain tokenizer?

  2. I pre-trained the adaptedGPT2 and got the evaluation results:

Model EM BLUE CodeBLUE
CodeGPT(yours) 18.25 28.69 32.71
CodeGPT-adapted(yours) 20.10 32.79 35.98
CodeGPT-adapted(I pretrained) 5.2 32.45 35.52

BLUE and CodeBLUE are similar to the results you provided. Why is EM so different from yours?

  1. I found that the prediction result of CodeGPT or CodeGPT-adapted was token level:
    void function ( ) { TSTNode loc0 = root ; while ( loc0 != null ) { if ( loc0 . data . equals ( data ) ) return ; loc0 = loc0 . left ; } }

    The generated code is not readable. Is it because of fine-tuning the data format of the dataset CONCODE? If I want to generate an executable and more readable code snippet like the following, what should I do?

    捕获
skye95git commented 2 years ago

It is very strange. I pre-trained CodeGPT-adapted and CodeGPT. Every time I fine-tune them, I get a difference in evaluation results. I set the same parameters each time I fine-tune.

Model | EM | BLUE | CodeBLUE -- | -- | -- | -- CodeGPT(yours) | 18.25 | 28.69 | 32.71 CodeGPT(I pretrained one) | 8.95 | 28.81 | 33.05 CodeGPT-adapted(yours) | 20.10 | 32.79 | 35.98 CodeGPT-adapted(I pretrained one) | 5.2 | 32.45 | 35.52 CodeGPT-adapted(I pretrained two) | 2.5 | 27.41 | 36.47