siyuyuan / coscript

Resources for our ACL 2023 paper: Distilling Script Knowledge from Large Language Models for Constrained Language Planning
35 stars 0 forks source link

Missing Categories In Source Data (WikiHow) #1

Closed yuto3o closed 2 months ago

yuto3o commented 4 months ago

hello! In the coscript dataset, each data point has a corresponding category. As mentioned in your paper, the data is sourced from WikiHow (Koupaee and Wang, 2018), but the category information is not included in the provided dataset. Do I need to re-align this data to the category information available on wikihow.com?

siyuyuan commented 4 months ago

hello! In the coscript dataset, each data point has a corresponding category. As mentioned in your paper, the data is sourced from WikiHow (Koupaee and Wang, 2018), but the category information is not included in the provided dataset. Do I need to re-align this data to the category information available on wikihow.com?

Thank you for your question. The category is in the data. You can find it in the value of key "Category" in each data.

yuto3o commented 4 months ago

But I can't find "Category" in wikihow data, both wikihowAll.csv and wikihowSep.csv file.

siyuyuan commented 4 months ago

But I can't find "Category" in wikihow data, both wikihowAll.csv and wikihowSep.csv file.

Oh, I apologize. I thought you were talking about coscript. If you're referring to the original Wikihow data, you would indeed need to go to the official website to scrape and align it.

yuto3o commented 4 months ago

ok, thank you :)

yuto3o commented 2 months ago

I have another problem. I trained GPT-2 on coscript and used the BLEU score to evaluate the model. However, I found that when I used BLEU-4, I could only achieve poor results (~7 pts) across scales (GPT-base or large). But the BLEU values reported in the paper were over 15+. Can you please explain the setting of BLEU?

siyuyuan commented 2 months ago

I have another problem. I trained GPT-2 on coscript and used the BLEU score to evaluate the model. However, I found that when I used BLEU-4, I could only achieve poor results (~7 pts) across scales (GPT-base or large). But the BLEU values reported in the paper were over 15+. Can you please explain the setting of BLEU?

Thank you for your question. We found that the performance of the model is relatively sensitive to the parameters, so we conducted a lot of tuning experiments to find the best performance of the model. Could you please tell me about your parameter Settings? Since I am attending an acl meeting now, I will inform you about our parameter Settings later.

yuto3o commented 2 months ago

Thank you for your prompt response.

Yes, the model is very sensitive to parameter settings. After several experiments, I have found appropriate parameters and have manually reviewed the model's outputs.

My most pressing need is to align the evaluation methods:

Given that the paper lacks specific details on the evaluation settings, I have chosen BLEU-4 (with n-gram set to 4) as my evaluation metric (from huggingface/evaluate/bleu).

In my experiments, the BLEU score remains notably low, indicating a significant gap. In contrast, both ROUGE and BERT-Score have achieved values comparable to those reported in the paper.

siyuyuan commented 2 months ago

Thank you for your prompt response.

Yes, the model is very sensitive to parameter settings. After several experiments, I have found appropriate parameters and have manually reviewed the model's outputs.

My most pressing need is to align the evaluation methods:

Given that the paper lacks specific details on the evaluation settings, I have chosen BLEU-4 (with n-gram set to 4) as my evaluation metric (from huggingface/evaluate/bleu).

In my experiments, the BLEU score remains notably low, indicating a significant gap. In contrast, both ROUGE and BERT-Score have achieved values comparable to those reported in the paper.

Sorry for the lacking of specific details about the evaluation settings. In our automatic evaluation metrics, we adopt

from nltk.translate.bleu_score import sentence_bleu
json_data = []
with open(address, 'r', encoding="utf-8") as f:
    for jsonstr in f.readlines():
        jsonstr = json.loads(jsonstr)
        json_data.append(jsonstr)
p = []
r = []
reference = []
candidate = []
for data in json_data:
    p.append(data["Generated Events"])
    r.append(data["Golden Events"])
l = len(r)
score = []
for i in range(l):
    score.append(sentence_bleu([r[i].split()], p[i].split(), weights=(1, 0, 0, 0)))
print(round(sum(score) / l, 4))

However, considering that the evaluation method in our paper is used two years ago (we completed the first draft of this work in September 2022) and we have proved through subsequent experiments that GPT-4 can already achieve good performance in Coscript, we strongly recommend that you consider new evaluation method such as using GPT-4 to evaluate the accuracy of the results generated by the small model, and ask GPT-4 to evaluate the error types we defined in the article.

yuto3o commented 2 months ago

Thank you again for your detailed answers :-)