seonghyeonye / TAPP

[AAAI 2024] Investigating the Effectiveness of Task-Agnostic Prefix Prompt for Instruction Following
https://arxiv.org/abs/2302.14691
MIT License
79 stars 2 forks source link

how to score and the version for GPT-J 6B and GPT-NeoX 20B #3

Closed 2018211801 closed 1 year ago

2018211801 commented 1 year ago

I'm a nlp beginner and I'm not familiar with ROUGH, I noticed that there are GPT-J 6B and GPT-NeoX 20Bseveral grading files collect_metric.py, and the run.sh and readme files under ROUGH. They are nonuniform, how do you score the prediction? And I would be very appreciated if you could provide the version for GPT-J 6B and GPT-NeoX 20B!

2018211801 commented 1 year ago

After a day of researching the code, I found that the scores you used in your paper seem to be the all_rougeL values from the compute_metrics.py, is that right? And in my opinion, if I want to save costs, I should reduce the number of instances selected in each task class. Because the score of each task is so different.

hbin0701 commented 1 year ago

To answer your first question:

"I would be very appreciated if you could provide the version for GPT-J 6B and GPT-NeoX 20B!"

The scripts for running open-sourced models for inference can be found in ICIL/scripts/decoder. To run GPT-J 6B, and GPT-NeoX 20B, change the model_name_or_path parameter to the one that fits your interest.

For the second part,

In your paper seem to be the all_rougeL values from the compute_metrics.py, is that right?

The short answer is "No." In compute_metrics.py, specifially at line 161, you can see the metric that each subtask adopts either RougeL or Exact Match, depending on its task category. For Figure 1, as we report Average performance of 119 evaluation tasks on SUPERNI benchmark., we take average of all the tasks performance, and this would be a mixture of exact match score and RougeL scores.

Finally, you have suggested,

If I want to save costs, I should reduce the number of instances selected in each task class. Because the score of each task is so different.

If I understand you correctly, you believe that since the scores of each task are vastly different, it would be better to decrease the number of instances per task instead of reducing the number of tasks being tested. While this is certainly an option, keep in mind that if you want to compare performance across different models, you would need a sufficient number of instances per task to accurately evaluate each model's performance.

2018211801 commented 1 year ago

Thanks very much!