microsoft / TOXIGEN

This repo contains the code for generating the ToxiGen dataset, published at ACL 2022.
Other
271 stars 33 forks source link

IndexError from alice.py #19

Open 482c opened 1 year ago

482c commented 1 year ago

Hey! Awesome paper and thank you for the open resources.

I am trying to reproduce generate_text.ipynb from the notebooks in Google Colab. The link in the notebook to Google Colab displayed an error so I created a duplicate here.

Date Seen (06/05/2023)

Versions Python 3.10

Steps to Reproduce The bug occurred when calling alice() as shown in the notebook.

bug

The same thing happens with the command:

!python generate.py --input_prompt_file /content/drive/MyDrive/coding_projects/toxigen/prompts/neutral_black_1k.txt --language_model GPT3 --classifier RoBERTa --ALICE True --output_file test_file.txt --num_generations_per_prompt 10 --generation_mode neutral --endpoint_url https://api.openai.com/v1/engines/text-ada-001/completions --api_key <API-KEY>

There was a minor bug from generate.py, which can be resolved by rewriting the line to f.write(f"{response}\n").

Traceback (most recent call last):
  File "/mnt/c/Users/tranh/Desktop/unistuff/3iib/finalproject/chatbot-utterances/TOXIGEN-main/generate.py", line 52, in <module>
    main()
  File "/mnt/c/Users/tranh/Desktop/unistuff/3iib/finalproject/chatbot-utterances/TOXIGEN-main/generate.py", line 48, in main
    f.write(response + "\n")
TypeError: unsupported operand type(s) for +: 'dict' and 'str'

However, the main problem is IndexError and I am not sure how to fix it.

Thartvigsen commented 1 year ago

Hi thanks for your interest in our work! Good catch in generate.py. Thanks!

The IndexError in alice.py is likely coming from changes in the GPT-3 API, though I can't confirm. It could help to investigate where that range error is coming from in the file itself and see if one of those input lists is perhaps empty? Assuming this stems from an API change, I don't have access at this point to dig into it.

zqypku commented 1 year ago

I also met these two problems.

  1. IndexError This is caused by predicting things like '\n', ' ', or '<|endoftext|>', which can be solved by pre-defining some stopwords.
  2. f.write(f"{response}\n") response here is a list of sentences. (But in your error report it seems like a dict.) So you can just print it out and see how you can extract the string to write. E.g., I changed it to:
    with open(args.output_file, "a") as f:
    for r in response:
        f.write("- "+r)
    f.write('\n')
Oreki-PJ commented 1 year ago

I also met these two problems.

  1. IndexError This is caused by predicting things like '\n', ' ', or '<|endoftext|>', which can be solved by pre-defining some stopwords.
  2. f.write(f"{response}\n") response here is a list of sentences. (But in your error report it seems like a dict.) So you can just print it out and see how you can extract the string to write. E.g., I changed it to:
with open(args.output_file, "a") as f:
    for r in response:
        f.write("- "+r)
    f.write('\n')

I also met the IndexError, but i don't know where and how to pre-defining these stopwords? Can you help me? Thanks!

mmmency commented 11 months ago

same things happen in toxigen-hatebert. I' m trying to detect toxity of some sentences by using pretrained toxigen-hatebert, then the error occurs. I check my input size and vocab size, but gain nothing.

Thartvigsen commented 11 months ago

Hi @mmmency is this error only happening with toxigen-hatebert? This thread is about the ALICE method.

If you're just running into index errors with toxigen-hatebert, this thread discusses how you need to use the bert-base-uncased tokenizer with toxigen-hatebert

arinakosovskaia commented 5 months ago

Hi thanks for your interest in our work! Good catch in generate.py. Thanks!

The IndexError in alice.py is likely coming from changes in the GPT-3 API, though I can't confirm. It could help to investigate where that range error is coming from in the file itself and see if one of those input lists is perhaps empty? Assuming this stems from an API change, I don't have access at this point to dig into it.

Hi @Thartvigsen ! I also encountered this error. The problem is that when text is generated, since we generate only one token, API may return the following response:

{text: '', index: 0, logprobs: null, finish_reason: 'stop'}

In such cases, outputs['choices'][i]['logprobs']['top_logprobs'] is an empty array, not a dictionary of possible tokens with their corresponding scores. Since the generation for this sentence was completed, the simplest solution that works great with the code, in my opinion, is to add in such cases to scores[i] not outputs['choices'][i]['logprobs']['top_logprobs'], but a placeholder option {'\n': 0.0, ' ': -100, '.': -100, '<|endoftext|>': -100, '\n': -100}

I made a pull request with the corresponding fix, I hope it will be helpful to everyone who would like to reproduce the code 😊

32

Thartvigsen commented 4 months ago

Thank you @arinakosovskaia!