beamsearch.py script is broken

msintaha commented 1 year ago

Hi @jiang719 @lin-tan

We have somehow been able to train the model, but the inference step fails for the model. In beamsearch.py, we keep getting the same error when attempting to generate the hypothesis (both in cpu and gpu mode) by running src/tester/generator.py. Regardless of the device, we are always getting error here.

/conda-envs/cure-debug-env/lib/python3.8/site-packages/torch/nn/functional.py:1960: UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.
  warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.")
Traceback (most recent call last):
  File "src/tester/generator.orig.py", line 134, in <module>
    generate_gpt_conut(vocab_file, model_file, input_file, identifier_txt_file, identifier_token_file, output_file, beam_size)
  File "src/tester/generator.py", line 89, in generate_gpt_conut
    generator.generate(output_file)
  File "src/tester/generator.py", line 39, in generate
    hypothesis = self.beamsearch.generate_gpt_conut(sample)
  File "/scratch/st-amesbah-1/cure-debug/cure/src/tester/beamsearch.py", line 570, in generate_gpt_conut
    logits = self.model.decode(
  File "/scratch/st-amesbah-1/cure-debug/cure/src/tester/beamsearch.py", line 114, in decode
    logits = self.model.decoder(
  File "/conda-envs/cure-debug-env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/scratch/st-amesbah-1/cure-debug/cure/src/tester/../models/gpt_conut.py", line 313, in forward
    embed = share_embed_model.transformer(
  File "/conda-envs/cure-debug-env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/nnashid/.local/lib/python3.8/site-packages/transformers/modeling_openai.py", line 429, in forward
    inputs_embeds = self.tokens_embed(input_ids)
  File "/conda-envs/cure-debug-env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/conda-envs/cure-debug-env/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "/conda-envs/cure-debug-env/lib/python3.8/site-packages/torch/nn/functional.py", line 2199, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

nashid commented 1 year ago

@jiang719 we are stuck with this problem. We trained the GPT-CoNuT model and ran inference. However, during inference we are keep getting the above error. We would really appreciate your insight into this error.

jiang719 commented 1 year ago

This is likely due to a problem in the vocabulary part. Are you using the pre-trained GPT model I shared when you train your own GPT-CoNuT model?
It could also be the problem of the data format. Make sure you follow the three steps in CURE/data/data/prepare_testing_data.py to prepare the test data as the required format.

nashid commented 1 year ago

We have trained GPT-CoNuT with our dataset. My colleague @msintaha already looked into the steps for dataset creation to ensure we are following the same format. But we will cross-check again in our side.

@jiang719 thanks for your feedback, we really appreciate it.

jiang719 commented 1 year ago

The point of the first possible cause is that, when you train your own GPT-CoNuT model, did you only change the train_file and valid_file in src/trainer/gpt_conut_trainer.py and keep the vocab_file and gpt_file unchanged? If that's the case, the model should be fine and the problem is more likely to be in the test data.

You could share one test instance in the input_file in src/tester/generator.py, and its corresponding line in the identifier_txt_file and identifier_token_file so I can see if it looks correct.

msintaha commented 1 year ago

Yes, here you go

input_bpe.txt

app . use ( express . static ( path . join ( _ _ dirname + $STRING$ ) ) ) ; <CTX> var _ = require ( $STRING$ ) ; var express = require ( $STRING$ ) ; var app = express ( ) ; var http = require ( $STRING$ ) . Server ( app ) ; var path = require ( $STRING$ ) ; var io = require ( $STRING$ ) ( http ) ; const PORT = process . env . PORT || $NUMBER$ ; var users = [ ] ; app . use ( express . static ( path . join ( _ _ dirname + $STRING$ ) ) ) ; app . get ( $STRING$ , ( req , res ) = > { res . send ( JSON . stringify ( users ) ) ; } ) ; app . get ( $STRING$ , ( req , res ) = > { res . send ( JSON . stringify ( _ . find ( users , ( user ) = > user . id == == req . query . user CaMeL Id ) ) ) ; } ) ; app . get ( $STRING$ , ( req , res ) = > { console . log ( $STRING$ ) ; res . send CaMeL File ( path . join ( _ _ dirname + $STRING$ ) ) ; } ) ; io . on ( $STRING$ , ( socket ) = > { console . log ( ` a user connected : $ { socket . id } ` ) ; socket . on ( $STRING$ , ( player ) = > { users . push ( player ) ; socket . broadcast . emit ( $STRING$ , users ) ; } ) ; socket . on ( $STRING$ , ( payload ) = > { var user = _ . find ( users , ( user ) = > user . id == == payload . user CaMeL Id ) ; user . life = payload . life ; socket . broadcast . emit ( $STRING$ , users ) ; } ) ; } ) ; http . listen ( PORT , ( ) = > { console . log ( $STRING$ ) ; } ) ;@@

identifier.txt

send if throw 1 return post code http ++ express , Server exports var function router Router static ] msg err extends $NUMBER$ PORT from __dirname ) getElementById > find obj _ 0xffffffff catch io async class type content get JSON options continue document 0x7f push switch || env id use break ! res + body [ listen user connect ; result else PropTypes error for T const 0 typeof sendFile - app Route undefined key payload import on } React req ( connection < while console join e false broadcast i life value users emit length host 0x1f style name set state message do = $STRING$ action userId log await of data in url => socket <<unk>> true module path config node stringify process done new axios query { . player require === :

identifier.tokens

send <SEP> if <SEP> throw <SEP> 1 <SEP> return <SEP> post <SEP> code <SEP> http <SEP> ++ <SEP> express <SEP> , <SEP> Server <SEP> exports <SEP> var <SEP> function <SEP> router <SEP> Router <SEP> static <SEP> ] <SEP> msg <SEP> err <SEP> extends <SEP> $NUMBER$ <SEP> PORT <SEP> from <SEP> _ _ dirname <SEP> ) <SEP> get CaMeL Element CaMeL By CaMeL Id <SEP> > <SEP> find <SEP> obj <SEP> _ <SEP> 0 xffffffff <SEP> catch <SEP> io <SEP> async <SEP> class <SEP> type <SEP> content <SEP> get <SEP> JSON <SEP> options <SEP> continue <SEP> document <SEP> 0 x $NUMBER$ f <SEP> push <SEP> switch <SEP> || <SEP> env <SEP> id <SEP> use <SEP> break <SEP> ! <SEP> res <SEP> + <SEP> body <SEP> [ <SEP> listen <SEP> user <SEP> connect <SEP> ; <SEP> result <SEP> else <SEP> Prop CaMeL Types <SEP> error <SEP> for <SEP> T <SEP> const <SEP> 0 <SEP> typeof <SEP> send CaMeL File <SEP> - <SEP> app <SEP> Route <SEP> undefined <SEP> key <SEP> payload <SEP> import <SEP> on <SEP> } <SEP> React <SEP> req <SEP> ( <SEP> connection <SEP> < <SEP> while <SEP> console <SEP> join <SEP> e <SEP> false <SEP> broadcast <SEP> i <SEP> life <SEP> value <SEP> users <SEP> emit <SEP> length <SEP> host <SEP> 0 x 1 f <SEP> style <SEP> name <SEP> set <SEP> state <SEP> message <SEP> do <SEP> = <SEP> $STRING$ <SEP> action <SEP> user CaMeL Id <SEP> log <SEP> await <SEP> of <SEP> data <SEP> in <SEP> url <SEP> = > <SEP> socket <SEP> <<unk>> <SEP> true <SEP> module <SEP> path <SEP> config <SEP> node <SEP> stringify <SEP> process <SEP> done <SEP> new <SEP> axios <SEP> query <SEP> { <SEP> . <SEP> player <SEP> require <SEP> == == = <SEP> :

jiang719 commented 1 year ago

@msintaha Looks like you only run the prepare_cure_input function.

there are two remaining steps:

run subword-nmt to tokenize these lines into subwords.
run clean_testing_bpe to finalize the input files.

Please check the readme file under CURE/data/data, the Prepare Test Input section shows the steps. if possible, I recommend you integrate these three steps into your own script.

msintaha commented 1 year ago

I have actually run those as well, using the subword.txt generated. It was mentioned in the prepare_cure_input script at the end

msintaha commented 1 year ago

First i generated the vocab using subword-nmt learn-joint-bpe-and-vocab --input training_tokenize.txt -s 50000 -o subword.txt --write-vocabulary vocabulary.txt

Then i ran:

subword-nmt apply-bpe -c subword.txt < training_tokenize.txt > training_bpe.txt
subword-nmt apply-bpe -c subword.txt < input.txt > input_bpe.txt
subword-nmt apply-bpe -c subword.txt < validation_tokenize.txt > validation_bpe.txt
subword-nmt apply-bpe -c subword.txt < identifier.tokens > identifier_bpe.tokens

jiang719 commented 1 year ago

First i generated the vocab using subword-nmt learn-joint-bpe-and-vocab --input training_tokenize.txt -s 50000 -o subword.txt --write-vocabulary vocabulary.txt

Then i ran:
subword-nmt apply-bpe -c subword.txt < training_tokenize.txt > training_bpe.txt
subword-nmt apply-bpe -c subword.txt < input.txt > input_bpe.txt
subword-nmt apply-bpe -c subword.txt < validation_tokenize.txt > validation_bpe.txt
subword-nmt apply-bpe -c subword.txt < identifier.tokens > identifier_bpe.tokens

Then you should have a file called ifentifier_bpe.tokens, which should not contain <SEP>, as the input to the generator.py.

But now I assume the problem is the vocabulary, since you train your own subword-nmt, so the vocabulary file also changes. How many unique lines do you have in your own vocabulary.txt?

If you change the vocabulary file, you will need to re-train the GPT first (re-train a new Huggingface GPT model), since the one I shared can only recognize the 50057 vocabulary in data/vocabulary/vocabulary.txt. If your new vocabulary file contains more vocabulary, the index out of range error will be caused.

msintaha commented 1 year ago

We have 46,247 lines in the vocabulary.txt. And yes, the generated identifier_bpe.tokens file does not contain <SEP>

jiang719 commented 1 year ago

That looks reasonable. Could you enclose the call of generate_gpt_conut with try-catch and see if it crashes for every input or just some?

Another possibility I can imagine is the input exceeds the maximum length (1024 tokens) set for the GPT model. But this will only cause those long inputs to crash.

msintaha commented 1 year ago

The maximum input length is within 1022 tokens. We have enclosed it in try_catch block, and it crashes on all the inputs

ozzydong commented 1 year ago

Hi there @lin-tan , I just cloned your code and try to run it use your module which you have been trained.But it always has a error about follow that . D:\Python\python.exe E:/cure/CURE/src/tester/generator.py 50061 Traceback (most recent call last): File "E:/cure/CURE/src/tester/generator.py", line 135, in generate_gpt_conut(vocab_file, model_file, input_file, identifier_txt_file, identifier_token_file, output_file, beam_size) File "E:/cure/CURE/src/tester/generator.py", line 63, in generate_gpt_conut model_file, map_location='cpu' File "D:\Python\lib\site-packages\torch\serialization.py", line 713, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "D:\Python\lib\site-packages\torch\serialization.py", line 930, in _legacy_load result = unpickler.load() File "D:\Python\lib\site-packages\torch\serialization.py", line 746, in find_class return super().find_class(mod_name, name) ModuleNotFoundError: No module named 'transformers.configuration_openai' I would really appreciate your insight into this error. :)

studypython33 commented 1 year ago

@ozzydong I also met the same problem. Have you solved it？thank you

BaiGeiQiShi commented 1 year ago

@studypython33 I also met the same problem, too. Have you solved it？ Thanks in advance.

BaiGeiQiShi commented 1 year ago

@studypython33 I also met the same problem, too. Have you solved it？ Thanks in advance.

lin-tan / CURE

beamsearch.py script is broken #11