Open jeswan opened 4 years ago
Comment by iftenney Friday Apr 03, 2020 at 14:35 GMT
@pyeres @pruksmhc maybe this got renamed recently?
In the mean time, unless you're trying to probe GPT-1 you can just comment out this line: https://github.com/nyu-mll/jiant/blob/master/probing/get_and_process_all_data.sh#L35
Comment by pyeres Friday Apr 03, 2020 at 15:07 GMT
Looks like this is the result of PR https://github.com/nyu-mll/jiant/pull/881 "Replacing old GPT implement with the one from huggingface pytorch transformers". @HaokunLiu, can you take a look?
Comment by HaokunLiu Friday Apr 03, 2020 at 15:10 GMT
Okay. You can either choose auto as your tokenizer or simply use the same string as your input_module
On Fri, Apr 3, 2020 at 11:07 AM Phil Yeres notifications@github.com wrote:
Looks like this is the result of PR #881 https://github.com/nyu-mll/jiant/pull/881 "Replacing old GPT implement with the one from huggingface pytorch transformers". @HaokunLiu https://github.com/HaokunLiu, can you take a look?
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/nyu-mll/jiant/issues/1049#issuecomment-608495134, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIPK5GRQ42RNZOZYUI4L77TRKX3UDANCNFSM4L2RQSNA .
Comment by lovodkin93 Friday Apr 03, 2020 at 16:51 GMT
Okay. You can either choose auto as your tokenizer or simply use the same string as your input_module
I didn't quite follow you - where should I choose the auto tokenizer? Also, what is the auto-tokenizer? And what do you mean by using the same string as my input_module? @HaokunLiu
Comment by HaokunLiu Friday Apr 03, 2020 at 16:55 GMT
Oh, sorry. I thought it was the main program. For this preprocessing program , when you are using gpt, choose openai-gpt as your tokenizer, when you are using gpt2, choose gpt2-medium.
On Fri, Apr 3, 2020 at 12:52 PM lovodkin93 notifications@github.com wrote:
Okay. You can either choose auto as your tokenizer or simply use the same string as your input_module … <#m8940456267136264806> On Fri, Apr 3, 2020 at 11:07 AM Phil Yeres @.***> wrote: Looks like this is the result of PR #881 https://github.com/nyu-mll/jiant/pull/881 <#881 https://github.com/nyu-mll/jiant/pull/881> "Replacing old GPT implement with the one from huggingface pytorch transformers". @HaokunLiu https://github.com/HaokunLiu https://github.com/HaokunLiu, can you take a look? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1049 (comment) https://github.com/nyu-mll/jiant/issues/1049#issuecomment-608495134>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIPK5GRQ42RNZOZYUI4L77TRKX3UDANCNFSM4L2RQSNA .
I didn't quite follow you - where should I choose the auto tokenizer? Also, what is the auto-tokenizer? And what do you mean by using the same string as my input_module? @HaokunLiu https://github.com/HaokunLiu
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/nyu-mll/jiant/issues/1049#issuecomment-608548252, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIPK5GQJGQUWPNUV5XLG6YDRKYH3FANCNFSM4L2RQSNA .
Comment by lovodkin93 Friday Apr 03, 2020 at 16:58 GMT
Oh, sorry. I thought it was the main program. For this preprocessing program , when you are using gpt, choose openai-gpt as your tokenizer, when you are using gpt2, choose gpt2-medium.
You mean replace the "OpenAI.BPE" in the following line with "openai-gpt" or with "gpt2-medium"? https://github.com/nyu-mll/jiant/blob/master/probing/get_and_process_all_data.sh#L35 @HaokunLiu
Comment by HaokunLiu Friday Apr 03, 2020 at 17:33 GMT
Exactly
On Fri, Apr 3, 2020 at 12:58 PM lovodkin93 notifications@github.com wrote:
Oh, sorry. I thought it was the main program. For this preprocessing program , when you are using gpt, choose openai-gpt as your tokenizer, when you are using gpt2, choose gpt2-medium.
You mean replace the "OpenAI.BPE" in the following line with "openai-gpt" or with "gpt2-medium"?
https://github.com/nyu-mll/jiant/blob/master/probing/get_and_process_all_data.sh#L35 @HaokunLiu https://github.com/HaokunLiu
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/nyu-mll/jiant/issues/1049#issuecomment-608551466, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIPK5GT57L7AY3NM37N24NTRKYIULANCNFSM4L2RQSNA .
Issue by lovodkin93 Thursday Apr 02, 2020 at 15:05 GMT Originally opened as https://github.com/nyu-mll/jiant/issues/1049
hello, I've been trying to pre-process the data, as was written in the README file located in the probing idirectory. I ran the following command (which i took from the README mentioned above): mkdir -p $JIANT_DATA_DIR ./get_and_process_all_data.sh $JIANT_DATA_DIR
and got the following error message: Traceback (most recent call last): File "./retokenize_edge_data.py", line 97, in
main(sys.argv[1:])
File "./retokenize_edge_data.py", line 93, in main
retokenize_file(fname, args.tokenizer_name, worker_pool=worker_pool)
File "./retokenize_edge_data.py", line 83, in retokenize_file
for line in tqdm(worker_pool.imap(map_fn, inputs, chunksize=500), total=len(inputs)):
File "/cs/labs/oabend/lovodkin93/anaconda3/envs/jiant/lib/python3.6/site-packages/tqdm/_tqdm.py", line 1022, in iter
for obj in iterable:
File "/cs/labs/oabend/lovodkin93/anaconda3/envs/jiant/lib/python3.6/multiprocessing/pool.py", line 320, in
return (item for chunk in result for item in chunk)
File "/cs/labs/oabend/lovodkin93/anaconda3/envs/jiant/lib/python3.6/multiprocessing/pool.py", line 735, in next
raise value
ValueError: Unsupported tokenizer 'OpenAI.BPE'
i tried to download the "openai gpt-2" model which i saw had the tokenizer mentioned, but it appears it requires the python version to be 3.7, while the one the jiant environement is working on is of older version. Has anyone seen this error before, or knows how to solve it? @iftenney