nyu-mll / jiant-v1-legacy

The jiant toolkit for general-purpose text understanding models
MIT License
21 stars 9 forks source link

Unsupported tokenizer 'OpenAI.BPE' #1049

Open jeswan opened 4 years ago

jeswan commented 4 years ago

Issue by lovodkin93 Thursday Apr 02, 2020 at 15:05 GMT Originally opened as https://github.com/nyu-mll/jiant/issues/1049


hello, I've been trying to pre-process the data, as was written in the README file located in the probing idirectory. I ran the following command (which i took from the README mentioned above): mkdir -p $JIANT_DATA_DIR ./get_and_process_all_data.sh $JIANT_DATA_DIR

and got the following error message: Traceback (most recent call last): File "./retokenize_edge_data.py", line 97, in main(sys.argv[1:]) File "./retokenize_edge_data.py", line 93, in main retokenize_file(fname, args.tokenizer_name, worker_pool=worker_pool) File "./retokenize_edge_data.py", line 83, in retokenize_file for line in tqdm(worker_pool.imap(map_fn, inputs, chunksize=500), total=len(inputs)): File "/cs/labs/oabend/lovodkin93/anaconda3/envs/jiant/lib/python3.6/site-packages/tqdm/_tqdm.py", line 1022, in iter for obj in iterable: File "/cs/labs/oabend/lovodkin93/anaconda3/envs/jiant/lib/python3.6/multiprocessing/pool.py", line 320, in return (item for chunk in result for item in chunk) File "/cs/labs/oabend/lovodkin93/anaconda3/envs/jiant/lib/python3.6/multiprocessing/pool.py", line 735, in next raise value ValueError: Unsupported tokenizer 'OpenAI.BPE'

i tried to download the "openai gpt-2" model which i saw had the tokenizer mentioned, but it appears it requires the python version to be 3.7, while the one the jiant environement is working on is of older version. Has anyone seen this error before, or knows how to solve it? @iftenney

jeswan commented 4 years ago

Comment by iftenney Friday Apr 03, 2020 at 14:35 GMT


@pyeres @pruksmhc maybe this got renamed recently?

In the mean time, unless you're trying to probe GPT-1 you can just comment out this line: https://github.com/nyu-mll/jiant/blob/master/probing/get_and_process_all_data.sh#L35

jeswan commented 4 years ago

Comment by pyeres Friday Apr 03, 2020 at 15:07 GMT


Looks like this is the result of PR https://github.com/nyu-mll/jiant/pull/881 "Replacing old GPT implement with the one from huggingface pytorch transformers". @HaokunLiu, can you take a look?

jeswan commented 4 years ago

Comment by HaokunLiu Friday Apr 03, 2020 at 15:10 GMT


Okay. You can either choose auto as your tokenizer or simply use the same string as your input_module

On Fri, Apr 3, 2020 at 11:07 AM Phil Yeres notifications@github.com wrote:

Looks like this is the result of PR #881 https://github.com/nyu-mll/jiant/pull/881 "Replacing old GPT implement with the one from huggingface pytorch transformers". @HaokunLiu https://github.com/HaokunLiu, can you take a look?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/nyu-mll/jiant/issues/1049#issuecomment-608495134, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIPK5GRQ42RNZOZYUI4L77TRKX3UDANCNFSM4L2RQSNA .

jeswan commented 4 years ago

Comment by lovodkin93 Friday Apr 03, 2020 at 16:51 GMT


Okay. You can either choose auto as your tokenizer or simply use the same string as your input_module

I didn't quite follow you - where should I choose the auto tokenizer? Also, what is the auto-tokenizer? And what do you mean by using the same string as my input_module? @HaokunLiu

jeswan commented 4 years ago

Comment by HaokunLiu Friday Apr 03, 2020 at 16:55 GMT


Oh, sorry. I thought it was the main program. For this preprocessing program , when you are using gpt, choose openai-gpt as your tokenizer, when you are using gpt2, choose gpt2-medium.

On Fri, Apr 3, 2020 at 12:52 PM lovodkin93 notifications@github.com wrote:

Okay. You can either choose auto as your tokenizer or simply use the same string as your input_module … <#m8940456267136264806> On Fri, Apr 3, 2020 at 11:07 AM Phil Yeres @.***> wrote: Looks like this is the result of PR #881 https://github.com/nyu-mll/jiant/pull/881 <#881 https://github.com/nyu-mll/jiant/pull/881> "Replacing old GPT implement with the one from huggingface pytorch transformers". @HaokunLiu https://github.com/HaokunLiu https://github.com/HaokunLiu, can you take a look? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1049 (comment) https://github.com/nyu-mll/jiant/issues/1049#issuecomment-608495134>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIPK5GRQ42RNZOZYUI4L77TRKX3UDANCNFSM4L2RQSNA .

I didn't quite follow you - where should I choose the auto tokenizer? Also, what is the auto-tokenizer? And what do you mean by using the same string as my input_module? @HaokunLiu https://github.com/HaokunLiu

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/nyu-mll/jiant/issues/1049#issuecomment-608548252, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIPK5GQJGQUWPNUV5XLG6YDRKYH3FANCNFSM4L2RQSNA .

jeswan commented 4 years ago

Comment by lovodkin93 Friday Apr 03, 2020 at 16:58 GMT


Oh, sorry. I thought it was the main program. For this preprocessing program , when you are using gpt, choose openai-gpt as your tokenizer, when you are using gpt2, choose gpt2-medium.

You mean replace the "OpenAI.BPE" in the following line with "openai-gpt" or with "gpt2-medium"? https://github.com/nyu-mll/jiant/blob/master/probing/get_and_process_all_data.sh#L35 @HaokunLiu

jeswan commented 4 years ago

Comment by HaokunLiu Friday Apr 03, 2020 at 17:33 GMT


Exactly

On Fri, Apr 3, 2020 at 12:58 PM lovodkin93 notifications@github.com wrote:

Oh, sorry. I thought it was the main program. For this preprocessing program , when you are using gpt, choose openai-gpt as your tokenizer, when you are using gpt2, choose gpt2-medium.

You mean replace the "OpenAI.BPE" in the following line with "openai-gpt" or with "gpt2-medium"?

https://github.com/nyu-mll/jiant/blob/master/probing/get_and_process_all_data.sh#L35 @HaokunLiu https://github.com/HaokunLiu

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/nyu-mll/jiant/issues/1049#issuecomment-608551466, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIPK5GT57L7AY3NM37N24NTRKYIULANCNFSM4L2RQSNA .