Closed skye95git closed 2 years ago
I find the CodeGPT can be used in Code Completion and Code Generation. Is there any difference between the CodeGPT in the two tasks? Is it a pre-trained model that can be used for both tasks?
When I try to fine-tune CodeGPT on code generation task. But there is an error:
I have try several times, meet the same error. What should I do? Is it a data download problem?
Hi @skye95git , CodeGPT and CodeGPT-adapted are both pretrained with casual language model as pretrained task like in GPT-2. Since it is also the training target of code completion task, you can just use the code for code completion task to re-pretrain or continue pretrain the model. And it can be used for both code completion task and code generation task because they are both autoregressive generation tasks. About the error, I think it might be due to the download speed. It's a timeout error according to the first line log. You can try downloading the model in a fast network.
Hi @skye95git , CodeGPT and CodeGPT-adapted are both pretrained with casual language model as pretrained task like in GPT-2. Since it is also the training target of code completion task, you can just use the code for code completion task to re-pretrain or continue pretrain the model. And it can be used for both code completion task and code generation task because they are both autoregressive generation tasks. About the error, I think it might be due to the download speed. It's a timeout error according to the first line log. You can try downloading the model in a fast network.
Thanks for your reply! Yes, the error is due to the download speed. I have finished the fine-tune and evaluation CodeGPT on code generation task:
But the inference result is different from result in readme. Why the result is zero? Is my operation wrong?
We don't provide the test set ground truth, that's why the result is zero. You can generate your predictions and submit to codexglue@microsoft.com
and we will send your results back.
Hi, I calculate the CodeBLEU score
on code generation task by this script in CodeXGLUE/Code-Code/code-to-code-trans/evaluator/CodeBLEU/
. I use the CONCODE test set and run
python calc_code_bleu.py --refs /CodeXGLUE/Text-Code/text-to-code/evaluator/test.json --hyp /CodeXGLUE/Text-Code/text-to-code/evaluator/test.txt --lang java --params 0.25,0.25,0.25,0.25
The test.json
is CONCODE test set, and the test.txt
is the corresponding predicted result. There is an error ValueError: Incompatible Language version 11. Must be between 13 and 13
. What should I do?
I have another question: Can I pass the Concode test set directly into the --refs
parameter?
parser.add_argument('--refs', type=str, nargs='+', required=True,
help='reference files')
parser.add_argument('--hyp', type=str, required=True,
help='hypothesis file')
parser.add_argument('--lang', type=str, required=True,
choices=['java','js','c_sharp','php','go','python','ruby'],
help='programming language')
parser.add_argument('--params', type=str, default='0.25,0.25,0.25,0.25',
help='alpha, beta and gamma')
Hi, when I try to fine-tune CodeGPT with pre-train model CodeGPT-small-java
as fellow:
LANG=java
DATADIR=/platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/dataset/concode
OUTPUTDIR=/platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/save/concode_CodeGPT
PRETRAINDIR=microsoft/CodeGPT-small-java
LOGFILE=text2code_concode_CodeGPT.log
PER_NODE_GPU=8
python -m torch.distributed.launch --nproc_per_node=$PER_NODE_GPU run.py \
--data_dir=$DATADIR \
--langs=$LANG \
--output_dir=$OUTPUTDIR \
--pretrain_dir=$PRETRAINDIR \
--log_file=$LOGFILE \
--model_type=gpt2 \
--block_size=512 \
--do_train \
--node_index 0 \
--gpu_per_node $PER_NODE_GPU \
--learning_rate=5e-5 \
--weight_decay=0.01 \
--evaluate_during_training \
--per_gpu_train_batch_size=6 \
--per_gpu_eval_batch_size=12 \
--gradient_accumulation_steps=2 \
--num_train_epochs=30 \
--logging_steps=100 \
--save_steps=5000 \
--overwrite_output_dir \
--seed=42
I've tried it a few times. There is an error:
10/09/2021 16:41:09 - INFO - __main__ - [0, 36667, 12023, 1, 2]
10/09/2021 16:41:11 - INFO - filelock - Lock 139879773943904 acquired on /home/linjiayi/.cache/huggingface/transformers/33595bb220c9f28a0b5f118f74b92e9452ea8b2d57f95ff63ead768fd6d78fe7.370b83843c894ed8a095a0d4746bed76f0357559edebf4023f38652b971ca917.lock
10/09/2021 16:41:21 - INFO - filelock - Lock 139879773943904 released on /home/linjiayi/.cache/huggingface/transformers/33595bb220c9f28a0b5f118f74b92e9452ea8b2d57f95ff63ead768fd6d78fe7.370b83843c894ed8a095a0d4746bed76f0357559edebf4023f38652b971ca917.lock
HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Max retries exceeded with url: /microsoft/CodeGPT-small-java/1ce001a39943be8dc0ff6cf1ebd407608deb96b9b0a9bd1b45d786b1fc88e8ef (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f38288e2908>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))
Traceback (most recent call last):
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/urllib3/connection.py", line 170, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/urllib3/util/connection.py", line 73, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/socket.py", line 745, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/urllib3/connectionpool.py", line 706, in urlopen
chunked=chunked,
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/urllib3/connectionpool.py", line 382, in _make_request
self._validate_conn(conn)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/urllib3/connectionpool.py", line 1010, in _validate_conn
conn.connect()
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/urllib3/connection.py", line 353, in connect
conn = self._new_conn()
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/urllib3/connection.py", line 182, in _new_conn
self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f38288e2908>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/urllib3/connectionpool.py", line 756, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/urllib3/util/retry.py", line 574, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Max retries exceeded with url: /microsoft/CodeGPT-small-java/1ce001a39943be8dc0ff6cf1ebd407608deb96b9b0a9bd1b45d786b1fc88e8ef (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f38288e2908>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/transformers/modeling_utils.py", line 1257, in from_pretrained
user_agent=user_agent,
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/transformers/file_utils.py", line 1371, in cached_path
local_files_only=local_files_only,
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/transformers/file_utils.py", line 1626, in get_from_cache
http_get(url_to_download, temp_file, proxies=proxies, resume_size=resume_size, headers=headers)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/transformers/file_utils.py", line 1473, in http_get
r = requests.get(url, stream=True, proxies=proxies, headers=headers)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/requests/api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/requests/api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/requests/sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/requests/sessions.py", line 655, in send
r = adapter.send(request, **kwargs)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/requests/adapters.py", line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Max retries exceeded with url: /microsoft/CodeGPT-small-java/1ce001a39943be8dc0ff6cf1ebd407608deb96b9b0a9bd1b45d786b1fc88e8ef (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f38288e2908>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "run.py", line 653, in <module>
main()
File "run.py", line 615, in main
model = model_class.from_pretrained(pretrained)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/transformers/modeling_utils.py", line 1266, in from_pretrained
raise EnvironmentError(msg)
OSError: Can't load weights for 'microsoft/CodeGPT-small-java'. Make sure that:
- 'microsoft/CodeGPT-small-java' is a correct model identifier listed on 'https://huggingface.co/models'
- or 'microsoft/CodeGPT-small-java' is the correct path to a directory containing a file named one of pytorch_model.bin, tf_model.h5, model.ckpt.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 152570) of binary: /home/linjiayi/anaconda3/envs/python36/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_mj9tvknq/none_b3oaj556/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_mj9tvknq/none_b3oaj556/attempt_1/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_mj9tvknq/none_b3oaj556/attempt_1/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_mj9tvknq/none_b3oaj556/attempt_1/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_mj9tvknq/none_b3oaj556/attempt_1/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_mj9tvknq/none_b3oaj556/attempt_1/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_mj9tvknq/none_b3oaj556/attempt_1/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_mj9tvknq/none_b3oaj556/attempt_1/7/error.json
local_rank: 2, node_index: 0, gpu_per_node: 8
local_rank: 1, node_index: 0, gpu_per_node: 8
local_rank: 5, node_index: 0, gpu_per_node: 8
local_rank: 7, node_index: 0, gpu_per_node: 8
local_rank: 0, node_index: 0, gpu_per_node: 8
local_rank: 6, node_index: 0, gpu_per_node: 8
local_rank: 3, node_index: 0, gpu_per_node: 8
local_rank: 4, node_index: 0, gpu_per_node: 8
The task is stuck at local_rank: 4, node_index: 0, and gpu_per_node: 8
. Is it because the Microsoft/CodeGPT-small-Java
model was not downloaded successfully? Actually I just changed PRETRAINDIR
. When I use Microsoft/Codegpt-small-Java-AdaptedDGpT2
, I can fine tune it successfully. What should I do?
I want to repre-train CodeGPT for code generation task from scratch. The description in CodeXGLUE/Code-Code/CodeCompletion-token/README.md
shows:
We pre-train monolingual models respectively on Python and Java corpus from the CodeSearchNet dataset, which includes 1.1M Python functions and 1.6M Java methods. A function or method in training dataset consists function signature and function body. Some functions also contain NL docstrings.
I have a couple of questions about the pre-training data:
<s>
and </s>
at the beginning and end of the source code?run_lm.py
be used for pre-training?The
test.json
is CONCODE test set, and thetest.txt
is the corresponding predicted result. There is an errorValueError: Incompatible Language version 11. Must be between 13 and 13
. What should I do?
You can try to re-build my-languages.so
by running build.sh
here using the current version of tree-sitter.
I have another question: Can I pass the Concode test set directly into the
--refs
parameter?
I don't think so. You need feed a .txt file for refs.
The task is stuck at local_rank: 4, node_index: 0, and gpu_per_node: 8. Is it because the Microsoft/CodeGPT-small-Java model was not downloaded successfully? Actually I just changed PRETRAINDIR. When I use Microsoft/Codegpt-small-Java-AdaptedDGpT2 , I can fine tune it successfully. What should I do?
I think it is a huggingface's transformers issue. But you can just use CodeGPT-small-java-adaptedGPT2
which performs better than CodeGPT-small-java
.
- Do you use Python and Java corpus directly from CodeSearchNet dataset? Do I need to preprocess the CodeSearchNet dataset? What else do I need to do besides add \<s> and \</s> at the beginning and end of the source code?
- What is the pre-trained input data format? Can you give me an example?
- Can the training code in run_lm.py be used for pre-training?
<s> def function(a=0): return a </s>
run_lm.py
for pre-training, since code completion task shares the same training object with the casual language model training. We used TextDataset for loading pre-trained corpus.The
test.json
is CONCODE test set, and thetest.txt
is the corresponding predicted result. There is an errorValueError: Incompatible Language version 11. Must be between 13 and 13
. What should I do?You can try to re-build
my-languages.so
by runningbuild.sh
here using the current version of tree-sitter.I have another question: Can I pass the Concode test set directly into the
--refs
parameter?I don't think so. You need feed a .txt file for refs.
The task is stuck at local_rank: 4, node_index: 0, and gpu_per_node: 8. Is it because the Microsoft/CodeGPT-small-Java model was not downloaded successfully? Actually I just changed PRETRAINDIR. When I use Microsoft/Codegpt-small-Java-AdaptedDGpT2 , I can fine tune it successfully. What should I do?
I think it is a huggingface's transformers issue. But you can just use
CodeGPT-small-java-adaptedGPT2
which performs better thanCodeGPT-small-java
.
- Do you use Python and Java corpus directly from CodeSearchNet dataset? Do I need to preprocess the CodeSearchNet dataset? What else do I need to do besides add
andat the beginning and end of the source code?- What is the pre-trained input data format? Can you give me an example?
- Can the training code in run_lm.py be used for pre-training?
- We removed the NL comments in source codes and ignored line-breaks and indent when preprocessing.
- The data format is just the code itself. e.g.
<s> def function(a=0): return a </s>
- Yes. We used
run_lm.py
for pre-training, since code completion task shares the same training object with the casual language model training. We used TextDataset for loading pre-trained corpus.
Thanks for your reply! I have a few questions:
<s> def function(a=0): return a </s>
The data format of CodeSearchNet dataset is as follows:
"original_string": "def get_vid_from_url(url):\n \"\"\"Extracts video ID from URL.\n \"\"\"\n return match1(url, r'youtu\\.be/([^?/]+)') or \\\n match1(url, r'youtube\\.com/embed/([^/?]+)') or \\\n match1(url, r'youtube\\.com/v/([^/?]+)') or \\\n match1(url, r'youtube\\.com/watch/([^/?]+)') or \\\n parse_query_param(url, 'v') or \\\n parse_query_param(parse_query_param(url, 'u'), 'v')"
According to the example you gave, is it necessary to remove the function name like concode during preprocessing?
We pre-train monolingual models respectively on Python and Java corpus from the CodeSearchNet dataset, which includes 1.1M Python functions and 1.6M Java methods. To fine-tune CodeGPT on concode dataset for text2code generation on multi-GPUs at a single machine.
The concode dataset used for fine-tuning contains only the Java corpus. Can a pre-trained model codegpt-small-py-adaptedgpt2
generate both Python and Java code, or only Java code, after fine-tuning on the Java corpus?
Can run_lm.py
only be used to pre-train CodeGPT
? Or can pre-training be implemented for both CodeGPT
and CodeGPT-adapted
?
What should I be aware of if I repre-train CodeGPT-adapted
? Do I need to pre-load GPT-2
as a starting point?
CodeGPT-small-py
can only used for python data.args.pretrained_dir
with gpt2
, you can continue pre-train a model initialized with OpenAI-GPT2 checkpoint.args.pretrained_dir
with gpt2
- No. Sorry to give a misunderstanding example. We don't remove any identifiers.
- CodeGPT was pre-trained on monolingual data. So
CodeGPT-small-py
can only used for python data.- It can be used for both. Just setting
args.pretrained_dir
withgpt2
, you can continue pre-train a model initialized with OpenAI-GPT2 checkpoint.- Yes. Set
args.pretrained_dir
withgpt2
For the second answer, the data set used to fine-tune and evaluate the code generation model is ConCode, which contains only the Java corpus. So can Codegpt-small-py
generate Python code without fine-tuning on the concode dataset? How to evaluate the performance of CodegpT-Small-Py
?
We removed the NL comments in source codes and ignored line-breaks and indent when preprocessing. The data format is just the code itself.
Do single-line comments, multi-line comments, and document comments need to be removed from the source code?
Why do you remove the NL comments in source codes?
Does ignoring line-breaks and indent mean removing them? If yes, why do you remove them?
We remove all the comments in source codes as we want to pre-train a model focusing on code domain. About line-breaks and indent, we remove them to make input sequence shorter. we know that it may convey useful information but in our experiments, we found that we don't need preserve them in purpose as the pre-trained model can easily learn them from codes.
The
test.json
is CONCODE test set, and thetest.txt
is the corresponding predicted result. There is an errorValueError: Incompatible Language version 11. Must be between 13 and 13
. What should I do?You can try to re-build
my-languages.so
by runningbuild.sh
here using the current version of tree-sitter.I have another question: Can I pass the Concode test set directly into the
--refs
parameter?I don't think so. You need feed a .txt file for refs.
The task is stuck at local_rank: 4, node_index: 0, and gpu_per_node: 8. Is it because the Microsoft/CodeGPT-small-Java model was not downloaded successfully? Actually I just changed PRETRAINDIR. When I use Microsoft/Codegpt-small-Java-AdaptedDGpT2 , I can fine tune it successfully. What should I do?
I think it is a huggingface's transformers issue. But you can just use
CodeGPT-small-java-adaptedGPT2
which performs better thanCodeGPT-small-java
.
- Do you use Python and Java corpus directly from CodeSearchNet dataset? Do I need to preprocess the CodeSearchNet dataset? What else do I need to do besides add
andat the beginning and end of the source code?- What is the pre-trained input data format? Can you give me an example?
- Can the training code in run_lm.py be used for pre-training?
- We removed the NL comments in source codes and ignored line-breaks and indent when preprocessing.
- The data format is just the code itself. e.g.
<s> def function(a=0): return a </s>
- Yes. We used
run_lm.py
for pre-training, since code completion task shares the same training object with the casual language model training. We used TextDataset for loading pre-trained corpus.You can try to re-build my-languages.so by running build.sh here using the current version of tree-sitter.
After I re-build my-languages.so
by running build.sh, I run:
python calc_code_bleu.py --refs /home/linjiayi/CodeXGLUE/Text-Code/text-to-code/evaluator/test.json --hyp /home/linjiayi/CodeXGLUE/Text-Code/text-to-code/evaluator/test.txt --lang java --params 0.25,0.25,0.25,0.25
The result is
WARNING: There is no reference data-flows extracted from the whole corpus, and the data-flow match score degenerates to 0. Please consider ignoring this score.
ngram match: 0.00542230945681549, weighted ngram match: 0.07657225779288454, syntax_match: 0.0, dataflow_match: 0
CodeBLEU score: 0.020498641812425007
The --refs parameter
I set to test.json
of concode, and the -- hyp parameter
I set to CodeGPT's generated result on test set. The above results appear to be abnormal because my input parameters are wrong.
I don't think so. You need feed a .txt file for refs.
Should the -- refs
parameter be set to a .txt file? What is the content of TXT file? Is it a reference value? What does refs
mean?
There is only one code for each NL in the Concode dataset:
{"code": "void function ( ScriptOrFnNode arg0 ) { int loc0 = - 1 ; collectFuncNodes ( arg0 , loc0 , null ) ; }", "nl": "generate mappings for each function node and parameters and variables names associated with it . concode_field_sep int parentScope concode_elem_sep ArrayList functionBracePositions concode_elem_sep ObjArray funcObjects concode_elem_sep int functionNum concode_elem_sep ArrayList functionVarMappings concode_elem_sep int lastTokenCount concode_elem_sep ArrayList replacedTokens concode_field_sep boolean isInScopeChain concode_elem_sep void reset concode_elem_sep void leaveNestingLevel concode_elem_sep String getMappedToken concode_elem_sep String getPreviousTokenMapping concode_elem_sep void collectFuncNodes concode_elem_sep int sourceCompress concode_elem_sep void enterNestingLevel"}
Where does the reference code list come from? Or is the contents of the TXT file the same code as conCode test set NL?
I have solved it. The --refs
should be set to a file that holds the test set code
field.
We remove all the comments in source codes as we want to pre-train a model focusing on code domain. About line-breaks and indent, we remove them to make input sequence shorter. we know that it may convey useful information but in our experiments, we found that we don't need preserve them in purpose as the pre-trained model can easily learn them from codes.
Thanks for your reply! Therefore, before pre-training, I need to preprocess the following code
field of Codesearchnet data:
"def get_vid_from_url(url):\n \"\"\"Extracts video ID from URL.\n \"\"\"\n return match1(url, r'youtu\\.be/([^?/]+)') or \\\n match1(url, r'youtube\\.com/embed/([^/?]+)') or \\\n match1(url, r'youtube\\.com/v/([^/?]+)') or \\\n match1(url, r'youtube\\.com/watch/([^/?]+)') or \\\n parse_query_param(url, 'v') or \\\n parse_query_param(parse_query_param(url, 'u'), 'v')"
According to the following rules:
<s>
and </s>
at the beginning and end of the source codeThe result after preprocessing is as follow:
"<s> def get_vid_from_url(url): return match1(url, r'youtu\\.be/([^?/]+)') or match1(url, r'youtube\\.com/embed/([^/?]+)') or match1(url, r'youtube\\.com/v/([^/?]+)') or match1(url, r'youtube\\.com/watch/([^/?]+)') or parse_query_param(url, 'v') or parse_query_param(parse_query_param(url, 'u'), 'v') </s>"
Save the preprocessing result as pre_train.jsonl
. Is the -- data_dir
parameter in run_lm.py
set to pre_train.jsonl
?
The data_dir
is set to the directory where pre_train.jsonl
is in. And I think you need modify the TextDataset
to load data from file as you wish.
The
data_dir
is set to the directory wherepre_train.jsonl
is in. And I think you need modify theTextDataset
to load data from file as you wish.
Thanks for your reply! I will try it. I followed the readme steps to reproduce Text2Code Generation. I fine-tune with the following command:
LANG=java
DATADIR=/platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/dataset/concode
OUTPUTDIR=/platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/save/concode
PRETRAINDIR=microsoft/CodeGPT-small-java-adaptedGPT2
LOGFILE=text2code_concode.log
PER_NODE_GPU=8
python -m torch.distributed.launch --nproc_per_node=$PER_NODE_GPU run.py \
--data_dir=$DATADIR \
--langs=$LANG \
--output_dir=$OUTPUTDIR \
--pretrain_dir=$PRETRAINDIR \
--log_file=$LOGFILE \
--model_type=gpt2 \
--block_size=512 \
--do_train \
--node_index 0 \
--gpu_per_node $PER_NODE_GPU \
--learning_rate=5e-5 \
--weight_decay=0.01 \
--evaluate_during_training \
--per_gpu_train_batch_size=6 \
--per_gpu_eval_batch_size=12 \
--gradient_accumulation_steps=2 \
--num_train_epochs=30 \
--logging_steps=100 \
--save_steps=5000 \
--overwrite_output_dir \
--seed=42
After fine-tuning I calculate BLEU and EM using the following command:
python evaluator/evaluator.py -a=evaluator/test.json -p=evaluator/test.txt
I calculate CodeBLEU with the following command:
python calc_code_bleu.py --refs /platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/save/CodeGPT-adapted/test.gold --hyp /platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/save/CodeGPT-adapted/test.output --lang java --params 0.25,0.25,0.25,0.25
Model | EM | BLUE | CodeBLUE
-- | -- | -- | --
CodeGPT(yours) | 18.25 | 28.69 | 32.71
CodeGPT-adapted(yours) | 20.10 | 32.79 | 35.98
CodeGPT-adapted(mine) | 19.8 | 28.62 | 32.58
The evaluation results are shown in the table above. The gap was 4.17% for BLUE and 3.4% for CodeBLUE. Was there something wrong with my fine-tuning?
We only evaluate perplexity in dev set during training, but there's a gap between ppl and BLEU. So I suggest you choose the checkpoint which has the highest BLEU score on dev set. It brings improvement in our experiment.
We only evaluate perplexity in dev set during training, but there's a gap between ppl and BLEU. So I suggest you choose the checkpoint which has the highest BLEU score on dev set. It brings improvement in our experiment.
In fact, I chose checkpoint, which has the highest BLEU score on the test set.
On dev set:
Model | EM | BLUE -- | -- | -- checkpoint-5000-1.9364 | 17.05 | 18.05 checkpoint-10000-2.0459 | 16.45 | 21.98 checkpoint-15000-2.1478 | 16.3 | 22.0 checkpoint-20000-2.2717 | 15.8 | 23.25 checkpoint-25000-2.3528 | 16.1 | 24.03 checkpoint-30000-2.3921 | 16.2 | **24.47** checkpoint-last | 16.2 | **24.47**On test set:
Compared:
Model | EM | BLUE | CodeBLUE -- | -- | -- | -- CodeGPT(yours) | 18.25 | 28.69 | 32.71 CodeGPT-adapted(yours) | 20.10 | 32.79 | 35.98 CodeGPT-adapted(mine) | 19.8 | 28.62 | 32.60There is still a gap between my results and yours.
It seems that the BLEU score in dev set is still increasing, which indicts that the training process hasn't overfit. Maybe continue training this model would bring more improvements.
It seems that the BLEU score in dev set is still increasing, which indicts that the training process hasn't overfit. Maybe continue training this model would bring more improvements.
Thanks. I will try it.
Model | EM | BLEU | CodeBLEU -- | -- | -- | -- GPT-2 | 17.35 | 25.37 | 29.69 CodeGPT | 18.25 | 28.69 | 32.71CodeGPT shares the same model architecture and training object with GPT-2, consisting 12 layers of Transformer decoders.
I wonder what is the difference between GPT-2 and CodeGPT in the table. As described in the readme, their model structure is the same. Why is the performance of the model different when the training data are the same?
The
data_dir
is set to the directory wherepre_train.jsonl
is in. And I think you need modify theTextDataset
to load data from file as you wish.
I have modified the TextDataset
. Then I run the command to pretrain the adaptedGPT2
:
LANG=java
DATADIR=/platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/dataset/CodeSearchNet/java
OUTPUTDIR=/platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/save/pretrain_adaptedGPT2
PRETRAINDIR=gpt2
LOGFILE=CodeSearchNet_pretrain_adaptedGPT2_a100_8gpu.log
PER_NODE_GPU=8
python -m torch.distributed.launch --nproc_per_node=$PER_NODE_GPU run_lm.py \
--data_dir=$DATADIR \
--langs=$LANG \
--output_dir=$OUTPUTDIR \
--pretrain_dir=$PRETRAINDIR \
--log_file=$LOGFILE \
--model_type=gpt2 \
--block_size=1024 \
--do_train \
--gpu_per_node $PER_NODE_GPU \
--learning_rate=4e-4 \
--weight_decay=0.01 \
--evaluate_during_training \
--per_gpu_train_batch_size=8 \
--per_gpu_eval_batch_size=8 \
--gradient_accumulation_steps=4 \
--num_train_epochs=20 \
--logging_steps=100 \
--save_steps=1000 \
--seed=42 \
--overwrite_output_dir \
There is an error:
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/elastic/utils/store.py:53: FutureWarning: This is an experimental API and will be changed in future.
"This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_0/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_0/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_0/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_0/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_0/7/error.json
Traceback (most recent call last):
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
File "run_lm.py", line 717, in <module>
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
main()
File "run_lm.py", line 620, in main
main()
File "run_lm.py", line 620, in main
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2097714) of binary: /home/linjiayi/anaconda3/envs/python36/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_1/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_1/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_1/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_1/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_1/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_1/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_1/7/error.json
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2097920) of binary: /home/linjiayi/anaconda3/envs/python36/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 2/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=2
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_2/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_2/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_2/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_2/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_2/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_2/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_2/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_2/7/error.json
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2098017) of binary: /home/linjiayi/anaconda3/envs/python36/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 1/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=3
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_3/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_3/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_3/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_3/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_3/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_3/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_3/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_rcnx9u6d/none_omojhtq2/attempt_3/7/error.json
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Traceback (most recent call last):
File "run_lm.py", line 717, in <module>
main()
File "run_lm.py", line 620, in main
torch.cuda.set_device(args.local_rank)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2098190) of binary: /home/linjiayi/anaconda3/envs/python36/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/elastic/utils/store.py:71: FutureWarning: This is an experimental API and will be changed in future.
"This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.00029754638671875 seconds
{"name": "torchelastic.worker.status.TERMINATED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "2098189", "role": "default", "hostname": "dgx078.scc.idea", "state": "TERMINATED", "total_run_time": 21, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 1, "group_rank": 0, "worker_id": "2098190", "role": "default", "hostname": "dgx078.scc.idea", "state": "FAILED", "total_run_time": 21, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [1], \"role_rank\": [1], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 2, "group_rank": 0, "worker_id": "2098191", "role": "default", "hostname": "dgx078.scc.idea", "state": "FAILED", "total_run_time": 21, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [2], \"role_rank\": [2], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 3, "group_rank": 0, "worker_id": "2098192", "role": "default", "hostname": "dgx078.scc.idea", "state": "FAILED", "total_run_time": 21, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [3], \"role_rank\": [3], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 4, "group_rank": 0, "worker_id": "2098193", "role": "default", "hostname": "dgx078.scc.idea", "state": "FAILED", "total_run_time": 21, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [4], \"role_rank\": [4], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 5, "group_rank": 0, "worker_id": "2098194", "role": "default", "hostname": "dgx078.scc.idea", "state": "FAILED", "total_run_time": 21, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [5], \"role_rank\": [5], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 6, "group_rank": 0, "worker_id": "2098195", "role": "default", "hostname": "dgx078.scc.idea", "state": "FAILED", "total_run_time": 21, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [6], \"role_rank\": [6], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 7, "group_rank": 0, "worker_id": "2098196", "role": "default", "hostname": "dgx078.scc.idea", "state": "FAILED", "total_run_time": 21, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [7], \"role_rank\": [7], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "dgx078.scc.idea", "state": "SUCCEEDED", "total_run_time": 21, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 3}}
/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py:354: UserWarning:
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 2098190 (local_rank 1) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:
from torch.distributed.elastic.multiprocessing.errors import record
@record
def trainer_main(args):
# do train
**********************************************************************
warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/launch.py", line 173, in <module>
main()
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/launch.py", line 169, in main
run(args)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/run.py", line 624, in run
)(*cmd_args)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
***************************************
run_lm.py FAILED
=======================================
Root Cause:
[0]:
time: 2021-10-18_09:48:38
rank: 1 (local_rank: 1)
exitcode: 1 (pid: 2098190)
error_file: <N/A>
msg: "Process failed with exitcode 1"
=======================================
Other Failures:
[1]:
time: 2021-10-18_09:48:38
rank: 2 (local_rank: 2)
exitcode: 1 (pid: 2098191)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[2]:
time: 2021-10-18_09:48:38
rank: 3 (local_rank: 3)
exitcode: 1 (pid: 2098192)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[3]:
time: 2021-10-18_09:48:38
rank: 4 (local_rank: 4)
exitcode: 1 (pid: 2098193)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[4]:
time: 2021-10-18_09:48:38
rank: 5 (local_rank: 5)
exitcode: 1 (pid: 2098194)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[5]:
time: 2021-10-18_09:48:38
rank: 6 (local_rank: 6)
exitcode: 1 (pid: 2098195)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[6]:
time: 2021-10-18_09:48:38
rank: 7 (local_rank: 7)
exitcode: 1 (pid: 2098196)
error_file: <N/A>
msg: "Process failed with exitcode 1"
***************************************
It appears to be a distributed training exception. What should I do?
It seems that the BLEU score in dev set is still increasing, which indicts that the training process hasn't overfit. Maybe continue training this model would bring more improvements.
After I set --num_train_epochs=60
, I get the latest result:
- No. Sorry to give a misunderstanding example. We don't remove any identifiers.
- CodeGPT was pre-trained on monolingual data. So
CodeGPT-small-py
can only used for python data.- It can be used for both. Just setting
args.pretrained_dir
withgpt2
, you can continue pre-train a model initialized with OpenAI-GPT2 checkpoint.- Yes. Set
args.pretrained_dir
withgpt2
If I want to pre-train adaptedGPT2
, I should set args.pretrained_dir
with gpt2
. If I want to pre-train CodeGPT from scratch, what should I do? Do I need to set the args.pretrained_dir
parameter?
I wonder what is the difference between GPT-2 and CodeGPT in the table. As described in the readme, their model structure is the same. Why is the performance of the model different when the training data are the same?
The pre-trained data are different. GPT-2 model is not pre-trained on code domain dataset but CodeGPT is pre-trained on CodeSearchNet.
It appears to be a distributed training exception. What should I do?
I'm not quite sure how to solve this error. As far as I concern, if you want to run on a single node with 8GPUs, you also need set the param node_index
as 0 because it is -1 by default in our codes. If this error is still, maybe you can raise an issue on Pytorch forum.
If I want to pre-train adaptedGPT2, I should set args.pretrained_dir with gpt2. If I want to pre-train CodeGPT from scratch, what should I do? Do I need to set the args.pretrained_dir parameter?
Just leave it as default. The code will run into this branch (https://github.com/microsoft/CodeXGLUE/blob/main/Code-Code/CodeCompletion-token/code/run_lm.py#L679) and randomly initialized the model.
I wonder what is the difference between GPT-2 and CodeGPT in the table. As described in the readme, their model structure is the same. Why is the performance of the model different when the training data are the same?
The pre-trained data are different. GPT-2 model is not pre-trained on code domain dataset but CodeGPT is pre-trained on CodeSearchNet.
It appears to be a distributed training exception. What should I do?
I'm not quite sure how to solve this error. As far as I concern, if you want to run on a single node with 8GPUs, you also need set the param
node_index
as 0 because it is -1 by default in our codes. If this error is still, maybe you can raise an issue on Pytorch forum.If I want to pre-train adaptedGPT2, I should set args.pretrained_dir with gpt2. If I want to pre-train CodeGPT from scratch, what should I do? Do I need to set the args.pretrained_dir parameter?
Just leave it as default. The code will run into this branch (https://github.com/microsoft/CodeXGLUE/blob/main/Code-Code/CodeCompletion-token/code/run_lm.py#L679) and randomly initialized the model.
Thanks for the details. For pre-training CodeGPT from scratch, are these hyper-parameters the same as the ones used in the pre-train adaptedGPT2 scripts, or in the fine-tuning scripts? number of epochs, learning rate, per_gpu_train_batch_size, weight decay, gradient acc steps, optimizer
I have finished the pretrain and fine-tune of adaptedGPT2. When I run evaluation on dev set on single GPU:
export CUDA_VISIBLE_DEVICES=1
LANG=java
DATADIR=/platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/dataset/concode
OUTPUTDIR=/platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/save/concode/checkpoint-60000
PRETRAINDIR=/platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/save/concode/checkpoint-60000-2.9975
LOGFILE=text2code_concode_eval_1.log
python -u run.py \
--data_dir=$DATADIR \
--langs=$LANG \
--output_dir=$OUTPUTDIR \
--pretrain_dir=$PRETRAINDIR \
--log_file=$LOGFILE \
--model_type=gpt2 \
--block_size=512 \
--do_eval \
--logging_steps=100 \
--seed=42
there is an error:
Traceback (most recent call last):
File "run.py", line 653, in <module>
main()
File "run.py", line 644, in main
dev_bleu, dev_EM = eval_bleu(args, model, tokenizer, file_type='dev', num=2000)
File "run.py", line 357, in eval_bleu
for step, (batch, token_labels) in enumerate(test_dataloader):
File "/home/linjiayi/anaconda3/envs/Deepcs/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
data = self._next_data()
File "/home/linjiayi/anaconda3/envs/Deepcs/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 561, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/linjiayi/anaconda3/envs/Deepcs/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/linjiayi/anaconda3/envs/Deepcs/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/linjiayi/CodeXGLUE/Text-Code/text-to-code/code/dataset.py", line 116, in __getitem__
return torch.tensor(self.inputs[item]), torch.tensor(self.token_labels[item])
RuntimeError: Could not infer dtype of NoneType
It is very strange. In fact, I used the same script as above when I evaluated with the model you provided, but with a different PRETRAINDIR
parameter. It can be evaluated using the pre-training model you provide.
When I pretrain the CodeGPT from scratch, I set parameter args.pretrained_dir
as the default as you suggested. I run:
LANG=java
DATADIR=/platform_tech/linjiayi/CodeXGLUE/Code-Code/CodeCompletion-token/dataset/CodeSearchNet/java
OUTPUTDIR=/platform_tech/linjiayi/CodeXGLUE/Code-Code/CodeCompletion-token/save/pretrain_GPT2
LOGFILE=CodeSearchNet_pretrain_GPT2_a100_8gpu.log
PER_NODE_GPU=8
python -m torch.distributed.launch --nproc_per_node=$PER_NODE_GPU run_lm_pretrain.py \
--data_dir=$DATADIR \
--langs=$LANG \
--output_dir=$OUTPUTDIR \
--log_file=$LOGFILE \
--model_type=gpt2 \
--block_size=1024 \
--do_train \
--gpu_per_node $PER_NODE_GPU \
--learning_rate=4e-4 \
--weight_decay=0.01 \
--evaluate_during_training \
--per_gpu_train_batch_size=8 \
--per_gpu_eval_batch_size=8 \
--gradient_accumulation_steps=4 \
--num_train_epochs=20 \
--logging_steps=100 \
--save_steps=1000 \
--overwrite_output_dir \
--seed=42 \
There is an error
Start slurm job at Wed 20 Oct 2021 02:13:13 PM CST
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
WARNING:torch.distributed.run:--use_env is deprecated and will be removed in future releases.
Please read local_rank from `os.environ('LOCAL_RANK')` instead.
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : run_lm_pretrain.py
min_nodes : 1
max_nodes : 1
nproc_per_node : 8
run_id : none
rdzv_backend : static
rdzv_endpoint : 127.0.0.1:29500
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/elastic/utils/store.py:53: FutureWarning: This is an experimental API and will be changed in future.
"This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_0/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_0/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_0/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_0/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_0/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_0/7/error.json
10/20/2021 14:13:32 - WARNING - __main__ - Process rank: -8, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: False, world size: 8
10/20/2021 14:13:32 - WARNING - __main__ - Process rank: -3, device: cuda:5, n_gpu: 1, distributed training: True, 16-bits training: False, world size: 8
10/20/2021 14:13:32 - WARNING - __main__ - Process rank: -5, device: cuda:3, n_gpu: 1, distributed training: True, 16-bits training: False, world size: 8
10/20/2021 14:13:32 - WARNING - __main__ - Process rank: -2, device: cuda:6, n_gpu: 1, distributed training: True, 16-bits training: False, world size: 8
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 5 using best-guess GPU 5 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 6 using best-guess GPU 6 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
10/20/2021 14:13:32 - WARNING - __main__ - Process rank: -6, device: cuda:2, n_gpu: 1, distributed training: True, 16-bits training: False, world size: 8
10/20/2021 14:13:32 - WARNING - __main__ - Process rank: -7, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: False, world size: 8
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
10/20/2021 14:13:32 - WARNING - __main__ - Process rank: -4, device: cuda:4, n_gpu: 1, distributed training: True, 16-bits training: False, world size: 8
[W ProcessGroupNCCL.cpp:1569] Rank 4 using best-guess GPU 4 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
10/20/2021 14:13:32 - WARNING - __main__ - Process rank: -1, device: cuda:7, n_gpu: 1, distributed training: False, 16-bits training: False, world size: 1
Traceback (most recent call last):
File "run_lm_pretrain.py", line 731, in <module>
main()
File "run_lm_pretrain.py", line 697, in main
tokenizer = tokenizer_class.from_pretrained(args.tokenizer_dir, sep_token='<EOL>', bos_token='<s>', eos_token='</s>', pad_token='<pad>', unk_token='<|UNKNOWN|>')
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 1648, in from_pretrained
pretrained_model_name_or_path, revision=revision, use_auth_token=use_auth_token
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 3408, in get_fast_tokenizer_file
all_files = get_list_of_files(path_or_repo, revision=revision, use_auth_token=use_auth_token)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/transformers/file_utils.py", line 1687, in get_list_of_files
path_or_repo, revision=revision, token=token
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/huggingface_hub/hf_api.py", line 248, in model_info
r.raise_for_status()
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/requests/models.py", line 953, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/None
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 7 (pid: 1383506) of binary: /home/linjiayi/anaconda3/envs/python36/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_1/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_1/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_1/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_1/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_1/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_1/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_1/7/error.json
Traceback (most recent call last):
File "run_lm_pretrain.py", line 731, in <module>
main()
File "run_lm_pretrain.py", line 629, in main
torch.distributed.init_process_group(backend='nccl')
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=8, worker_count=16, timeout=0:30:00)
Traceback (most recent call last):
File "run_lm_pretrain.py", line 731, in <module>
main()
File "run_lm_pretrain.py", line 629, in main
torch.distributed.init_process_group(backend='nccl')
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=8, worker_count=16, timeout=0:30:00)
Traceback (most recent call last):
File "run_lm_pretrain.py", line 731, in <module>
main()
File "run_lm_pretrain.py", line 629, in main
torch.distributed.init_process_group(backend='nccl')
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 7, for key: store_based_barrier_key:1 (world_size=8, worker_count=16, timeout=0:30:00)
Traceback (most recent call last):
File "run_lm_pretrain.py", line 731, in <module>
main()
File "run_lm_pretrain.py", line 629, in main
torch.distributed.init_process_group(backend='nccl')
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 5, for key: store_based_barrier_key:1 (world_size=8, worker_count=16, timeout=0:30:00)
Traceback (most recent call last):
File "run_lm_pretrain.py", line 731, in <module>
main()
File "run_lm_pretrain.py", line 629, in main
torch.distributed.init_process_group(backend='nccl')
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 4, for key: store_based_barrier_key:1 (world_size=8, worker_count=16, timeout=0:30:00)
Traceback (most recent call last):
Traceback (most recent call last):
File "run_lm_pretrain.py", line 731, in <module>
File "run_lm_pretrain.py", line 731, in <module>
main()
File "run_lm_pretrain.py", line 629, in main
main()
File "run_lm_pretrain.py", line 629, in main
torch.distributed.init_process_group(backend='nccl')
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
torch.distributed.init_process_group(backend='nccl')
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
_store_based_barrier(rank, store, timeout)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 6, for key: store_based_barrier_key:1 (world_size=8, worker_count=16, timeout=0:30:00)
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 3, for key: store_based_barrier_key:1 (world_size=8, worker_count=16, timeout=0:30:00)
Traceback (most recent call last):
File "run_lm_pretrain.py", line 731, in <module>
main()
File "run_lm_pretrain.py", line 629, in main
torch.distributed.init_process_group(backend='nccl')
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=8, worker_count=16, timeout=0:30:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1383747) of binary: /home/linjiayi/anaconda3/envs/python36/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 2/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=2
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_2/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_2/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_2/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_2/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_2/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_2/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_2/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_2/7/error.json
Traceback (most recent call last):
File "run_lm_pretrain.py", line 731, in <module>
main()
File "run_lm_pretrain.py", line 629, in main
torch.distributed.init_process_group(backend='nccl')
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 5, for key: store_based_barrier_key:1 (world_size=8, worker_count=24, timeout=0:30:00)
Traceback (most recent call last):
Traceback (most recent call last):
File "run_lm_pretrain.py", line 731, in <module>
File "run_lm_pretrain.py", line 731, in <module>
main()
File "run_lm_pretrain.py", line 629, in main
main()
File "run_lm_pretrain.py", line 629, in main
torch.distributed.init_process_group(backend='nccl')
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
torch.distributed.init_process_group(backend='nccl')
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
_store_based_barrier(rank, store, timeout)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 4, for key: store_based_barrier_key:1 (world_size=8, worker_count=24, timeout=0:30:00)
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=8, worker_count=24, timeout=0:30:00)
Traceback (most recent call last):
File "run_lm_pretrain.py", line 731, in <module>
main()
File "run_lm_pretrain.py", line 629, in main
torch.distributed.init_process_group(backend='nccl')
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 3, for key: store_based_barrier_key:1 (world_size=8, worker_count=24, timeout=0:30:00)
Traceback (most recent call last):
Traceback (most recent call last):
File "run_lm_pretrain.py", line 731, in <module>
File "run_lm_pretrain.py", line 731, in <module>
main()
File "run_lm_pretrain.py", line 629, in main
main()
File "run_lm_pretrain.py", line 629, in main
torch.distributed.init_process_group(backend='nccl')
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
torch.distributed.init_process_group(backend='nccl')
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
_store_based_barrier(rank, store, timeout)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 6, for key: store_based_barrier_key:1 (world_size=8, worker_count=24, timeout=0:30:00)
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=8, worker_count=24, timeout=0:30:00)
Traceback (most recent call last):
File "run_lm_pretrain.py", line 731, in <module>
Traceback (most recent call last):
File "run_lm_pretrain.py", line 731, in <module>
main()
File "run_lm_pretrain.py", line 629, in main
main()
File "run_lm_pretrain.py", line 629, in main
torch.distributed.init_process_group(backend='nccl')
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
torch.distributed.init_process_group(backend='nccl')
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
_store_based_barrier(rank, store, timeout)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 7, for key: store_based_barrier_key:1 (world_size=8, worker_count=24, timeout=0:30:00)
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=8, worker_count=24, timeout=0:30:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1411598) of binary: /home/linjiayi/anaconda3/envs/python36/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 1/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=3
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
global_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
role_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
global_world_sizes=[8, 8, 8, 8, 8, 8, 8, 8]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_3/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_3/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_3/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_3/3/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker4 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_3/4/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker5 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_3/5/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker6 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_3/6/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker7 reply file to: /tmp/torchelastic_gjc2y4mz/none_wycu7f9g/attempt_3/7/error.json
Traceback (most recent call last):
File "run_lm_pretrain.py", line 731, in <module>
main()
File "run_lm_pretrain.py", line 629, in main
torch.distributed.init_process_group(backend='nccl')
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 3, for key: store_based_barrier_key:1 (world_size=8, worker_count=32, timeout=0:30:00)
Traceback (most recent call last):
File "run_lm_pretrain.py", line 731, in <module>
main()
File "run_lm_pretrain.py", line 629, in main
torch.distributed.init_process_group(backend='nccl')
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=8, worker_count=32, timeout=0:30:00)
Traceback (most recent call last):
File "run_lm_pretrain.py", line 731, in <module>
main()
File "run_lm_pretrain.py", line 629, in main
torch.distributed.init_process_group(backend='nccl')
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 4, for key: store_based_barrier_key:1 (world_size=8, worker_count=32, timeout=0:30:00)
Traceback (most recent call last):
File "run_lm_pretrain.py", line 731, in <module>
main()
File "run_lm_pretrain.py", line 629, in main
torch.distributed.init_process_group(backend='nccl')
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 7, for key: store_based_barrier_key:1 (world_size=8, worker_count=32, timeout=0:30:00)
Traceback (most recent call last):
File "run_lm_pretrain.py", line 731, in <module>
main()
File "run_lm_pretrain.py", line 629, in main
torch.distributed.init_process_group(backend='nccl')
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=8, worker_count=32, timeout=0:30:00)
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "run_lm_pretrain.py", line 731, in <module>
File "run_lm_pretrain.py", line 731, in <module>
File "run_lm_pretrain.py", line 731, in <module>
main()
File "run_lm_pretrain.py", line 629, in main
main()
File "run_lm_pretrain.py", line 629, in main
main()
File "run_lm_pretrain.py", line 629, in main
torch.distributed.init_process_group(backend='nccl')
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
torch.distributed.init_process_group(backend='nccl')
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
torch.distributed.init_process_group(backend='nccl')
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
_store_based_barrier(rank, store, timeout)
_store_based_barrier(rank, store, timeout)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 222, in _store_based_barrier
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 6, for key: store_based_barrier_key:1 (world_size=8, worker_count=32, timeout=0:30:00)
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 2, for key: store_based_barrier_key:1 (world_size=8, worker_count=32, timeout=0:30:00)
rank, store_key, world_size, worker_count, timeout))
RuntimeError: Timed out initializing process group in store based barrier on rank: 5, for key: store_based_barrier_key:1 (world_size=8, worker_count=32, timeout=0:30:00)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1439177) of binary: /home/linjiayi/anaconda3/envs/python36/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/elastic/utils/store.py:71: FutureWarning: This is an experimental API and will be changed in future.
"This is an experimental API and will be changed in future.", FutureWarning
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0003800392150878906 seconds
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 0, "group_rank": 0, "worker_id": "1439177", "role": "default", "hostname": "dgx018.scc.idea", "state": "FAILED", "total_run_time": 5448, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [0], \"role_rank\": [0], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 1, "group_rank": 0, "worker_id": "1439178", "role": "default", "hostname": "dgx018.scc.idea", "state": "FAILED", "total_run_time": 5448, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [1], \"role_rank\": [1], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 2, "group_rank": 0, "worker_id": "1439179", "role": "default", "hostname": "dgx018.scc.idea", "state": "FAILED", "total_run_time": 5448, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [2], \"role_rank\": [2], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 3, "group_rank": 0, "worker_id": "1439181", "role": "default", "hostname": "dgx018.scc.idea", "state": "FAILED", "total_run_time": 5448, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [3], \"role_rank\": [3], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 4, "group_rank": 0, "worker_id": "1439185", "role": "default", "hostname": "dgx018.scc.idea", "state": "FAILED", "total_run_time": 5448, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [4], \"role_rank\": [4], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 5, "group_rank": 0, "worker_id": "1439187", "role": "default", "hostname": "dgx018.scc.idea", "state": "FAILED", "total_run_time": 5448, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [5], \"role_rank\": [5], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 6, "group_rank": 0, "worker_id": "1439190", "role": "default", "hostname": "dgx018.scc.idea", "state": "FAILED", "total_run_time": 5448, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [6], \"role_rank\": [6], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.FAILED", "source": "WORKER", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": 7, "group_rank": 0, "worker_id": "1439192", "role": "default", "hostname": "dgx018.scc.idea", "state": "FAILED", "total_run_time": 5448, "rdzv_backend": "static", "raw_error": "{\"message\": \"<NONE>\"}", "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\", \"local_rank\": [7], \"role_rank\": [7], \"role_world_size\": [8]}", "agent_restarts": 3}}
{"name": "torchelastic.worker.status.SUCCEEDED", "source": "AGENT", "timestamp": 0, "metadata": {"run_id": "none", "global_rank": null, "group_rank": 0, "worker_id": null, "role": "default", "hostname": "dgx018.scc.idea", "state": "SUCCEEDED", "total_run_time": 5448, "rdzv_backend": "static", "raw_error": null, "metadata": "{\"group_world_size\": 1, \"entry_point\": \"python\"}", "agent_restarts": 3}}
/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py:354: UserWarning:
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 1439177 (local_rank 0) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:
from torch.distributed.elastic.multiprocessing.errors import record
@record
def trainer_main(args):
# do train
**********************************************************************
warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/launch.py", line 173, in <module>
main()
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/launch.py", line 169, in main
run(args)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/run.py", line 624, in run
)(*cmd_args)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/linjiayi/anaconda3/envs/python36/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
***************************************
run_lm_pretrain.py FAILED
=======================================
Root Cause:
[0]:
time: 2021-10-20_15:44:16
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 1439177)
error_file: <N/A>
msg: "Process failed with exitcode 1"
=======================================
Other Failures:
[1]:
time: 2021-10-20_15:44:16
rank: 1 (local_rank: 1)
exitcode: 1 (pid: 1439178)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[2]:
time: 2021-10-20_15:44:16
rank: 2 (local_rank: 2)
exitcode: 1 (pid: 1439179)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[3]:
time: 2021-10-20_15:44:16
rank: 3 (local_rank: 3)
exitcode: 1 (pid: 1439181)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[4]:
time: 2021-10-20_15:44:16
rank: 4 (local_rank: 4)
exitcode: 1 (pid: 1439185)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[5]:
time: 2021-10-20_15:44:16
rank: 5 (local_rank: 5)
exitcode: 1 (pid: 1439187)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[6]:
time: 2021-10-20_15:44:16
rank: 6 (local_rank: 6)
exitcode: 1 (pid: 1439190)
error_file: <N/A>
msg: "Process failed with exitcode 1"
[7]:
time: 2021-10-20_15:44:16
rank: 7 (local_rank: 7)
exitcode: 1 (pid: 1439192)
error_file: <N/A>
msg: "Process failed with exitcode 1"
***************************************
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
End slurm job at Wed Oct 20 15:44:16 CST 2021
The mistakes seems to be requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/None
and RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=8, worker_count=16, timeout=0:30:00)
.
I pretrain the adaptedGPT2
with the same script, just set args.pretrained_dir
with gpt2
:
LANG=java
DATADIR=/platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/dataset/CodeSearchNet/java
OUTPUTDIR=/platform_tech/linjiayi/CodeXGLUE/Text-Code/text-to-code/save/pretrain_adaptedGPT2
PRETRAINDIR=gpt2
LOGFILE=CodeSearchNet_pretrain_adaptedGPT2_a100_8gpu.log
PER_NODE_GPU=8
python -m torch.distributed.launch --nproc_per_node=$PER_NODE_GPU run_lm_pretrain.py \
--data_dir=$DATADIR \
--langs=$LANG \
--output_dir=$OUTPUTDIR \
--pretrain_dir=$PRETRAINDIR \
--log_file=$LOGFILE \
--model_type=gpt2 \
--block_size=1024 \
--do_train \
--gpu_per_node $PER_NODE_GPU \
--learning_rate=4e-4 \
--weight_decay=0.01 \
--evaluate_during_training \
--per_gpu_train_batch_size=8 \
--per_gpu_eval_batch_size=8 \
--gradient_accumulation_steps=4 \
--num_train_epochs=20 \
--logging_steps=100 \
--save_steps=1000 \
--overwrite_output_dir \
--seed=42 \
When I pre-trained adaptedGPT2
, I did not encounter the error of process group initialization failure.
It is very strange. In fact, I used the same script as above when I evaluated with the model you provided, but with a different PRETRAINDIR parameter. It can be evaluated using the pre-training model you provide.
For this error, one possible reason is the different tokenizer. For example, if your tokenizer doesn't have a special token named bos_token
, then when you call tokenizer.bos_token_id
it will return None
. Considering that the tokenizer used for pre-training and fine-tune on CONCODE are different. You may want to double check whether you have add the special tokens when fine-tuning.
The mistakes seems to be requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/api/models/None and RuntimeError: Timed out initializing process group in store based barrier on rank: 1, for key: store_based_barrier_key:1 (world_size=8, worker_count=16, timeout=0:30:00).
To pre-train CodeGPT
from scratch, you need to feed args.tokenizer_dir
and args.config_dir
with microsoft/CodeGPT-xx
so that the tokenizer can be loaded in this code. And I also notice that you didn't set the node_index
param. When you running in single machine, please set node_index
as 0 otherwise the local_rank
will be negative.
@celbree Thanks for your reply!
Considering that the tokenizer used for pre-training and fine-tune on CONCODE are different.
I have done pretrain and fine-tune. Errors are reported only when evaluating. Pretrain result:
fine-tune result:
An evaluation error was reported on the dev set:
RuntimeError: Could not infer dtype of NoneType
Why are args.tokenizer_dir
and args.config_dirset
to microsoft/CodeGPT-small-java
instead of gpt2
? Is it because the vocabulary of gpt2
is about text, while the vocabulary of microsoft/CodeGPT-small-java
is about code?
I looked it up on the Internet:
# OpenAI GPT-2
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
For pre-training CodeGPT from scratch, are these hyper-parameters the same as the ones used in the pre-train adaptedGPT2 scripts, or in the fine-tuning scripts? number of epochs, learning rate, per_gpu_train_batch_size, weight decay, gradient acc steps, optimizer
The tokenizer is almost the same but with a slight difference because we add several special tokens like concode_elem_sep
which would make the vocab size on fine-tuning larger than pre-training. About the NoneType error, it might because the tokenizer return None
when you call a non-exist special token as I said before. So could you please try to print the token ids when the error happens.
Because CodeGPT
and CodeGPT-adapted
are using totally difference tokenizer. We train a new BPE tokenizer on code domain for pre-training CodeGPT
. But for CodeGPT-adapted
, we just re-use the tokenizer in OpenAI GPT-2. So if you want to pre-train CodeGPT
from scratch, you can either use our trained BPE tokenizer by setting microsoft/CodeGPT-small-xx
or train a new tokenizer by yourself.
We increase the number of epochs and keep other hyper-parameters the same.
- The tokenizer is almost the same but with a slight difference because we add several special tokens like
concode_elem_sep
which would make the vocab size on fine-tuning larger than pre-training. About the NoneType error, it might because the tokenizer returnNone
when you call a non-exist special token as I said before. So could you please try to print the token ids when the error happens.- Because
CodeGPT
andCodeGPT-adapted
are using totally difference tokenizer. We train a new BPE tokenizer on code domain for pre-trainingCodeGPT
. But forCodeGPT-adapted
, we just re-use the tokenizer in OpenAI GPT-2. So if you want to pre-trainCodeGPT
from scratch, you can either use our trained BPE tokenizer by settingmicrosoft/CodeGPT-small-xx
or train a new tokenizer by yourself.- We increase the number of epochs and keep other hyper-parameters the same.
Thanks for your reply!
After I manually add "concode_elem_sep": 50261
into added_tokens.json
, I solved the problem Could not infer dtype of NoneType
. I have a few questions:
microsoft/CodeGPT-small-java-adaptedGPT2
you provided, the resulting file added_tokens.json
is as follows
{"<pad>": 50257, "</s>": 50258, "<s>": 50259, "concode_elem_sep": 50260}
After fine-tuning my own pre-trained model adaptedGPT2
, the resulting file added_tokens.json
is as follows:
{"<s>": 50257, "</s>": 50258, "<EOL>": 50259, "<pad>": 50260}
Special token concode_elem_sep
is missing from my result. But I have set PRETRAINDIR
to a checkpoint generated by pre-training:
PRETRAINDIR=/platform_tech/linjiayi/CodeXGLUE/Code-Code/CodeCompletion-token/save/pretrain_adaptedGPT2/checkpoint-last
I also added concode_elem_sep
to tokenizer as a special token:
https://github.com/microsoft/CodeXGLUE/blob/ae1d06f5505b3f71b6e1be36ee26028f17c09994/Text-Code/text-to-code/code/run.py#L613
I run the command:
if args.do_eval: # only works on 1 GPU
print(tokenizer.bos_token)
print(tokenizer.sep_token)
print(tokenizer.encode('<s>'))
print(tokenizer.encode('concode_elem_sep', add_special_tokens=True))
The result is:
<s>
concode_elem_sep
[50257]
[None]
Why is there no concode_elem_sep
in the added_tokens.json
file generated after fine-tuning my own pre-trained model? Is it appropriate to manually add concode_elem_sep
to added_tokens.json
file as I described above?
We train a new BPE tokenizer on code domain for pre-training CodeGPT. So if you want to pre-train CodeGPT from scratch, you can either use our trained BPE tokenizer by setting microsoft/CodeGPT-small-xx or train a new tokenizer by yourself.
I downloaded the vocab.json
from microsoft/CodeGPT-small-java
. I found <s>
, <pad>
and </s>
were already stored in it.
"<s>": 0,
"<pad>": 1,
"</s>": 2,
Why do I still need to add these special tokens when I'm pre-training from scratch? https://github.com/microsoft/CodeXGLUE/blob/ae1d06f5505b3f71b6e1be36ee26028f17c09994/Text-Code/text-to-code/code/run.py#L613
How to train a new BPE tokenizer on code domain? If I pre-trained in different programming languages, do I need to retrain tokenizer?
I pre-trained the adaptedGPT2
and got the evaluation results:
Model | EM | BLUE | CodeBLUE |
---|---|---|---|
CodeGPT(yours) | 18.25 | 28.69 | 32.71 |
CodeGPT-adapted(yours) | 20.10 | 32.79 | 35.98 |
CodeGPT-adapted(I pretrained) | 5.2 | 32.45 | 35.52 |
BLUE
and CodeBLUE
are similar to the results you provided. Why is EM so different from yours?
void function ( ) { TSTNode loc0 = root ; while ( loc0 != null ) { if ( loc0 . data . equals ( data ) ) return ; loc0 = loc0 . left ; } }
The generated code is not readable. Is it because of fine-tuning the data format of the dataset CONCODE? If I want to generate an executable and more readable code snippet like the following, what should I do?
It is very strange. I pre-trained CodeGPT-adapted
and CodeGPT
. Every time I fine-tune them, I get a difference in evaluation results. I set the same parameters each time I fine-tune.
Hi, I want to repre-train CodeGPT and CodeGPT-adapted models for code generation. How are these models pretrained? Do you plan to share the pre-train code?