paperswithcode / galai

Model API for GALACTICA
Apache License 2.0
2.67k stars 275 forks source link

Questions about reproducing the downstream scientific tasks result (especially BioASQ) #81

Closed SeonggwanAhn closed 11 months ago

SeonggwanAhn commented 1 year ago

Hello. I am Seonggwan, a researcher interested in domain specialized LLM.

I've been very impressed with your scientific language model, Galactica, and wanted to ask you some questions about it. Inspired by your model, I tried to reproduce the downstream scientific tasks of Galactica-120b. However, the result I got in the BioASQ task was 90.00% (5-shot), 88.57 (0-shot), which are drops of about 4~6% compared to the result(94.3%) reported in the paper.

So I would like to reproduce it more accurately, and I would be grateful if you could share the code you evaluated, or tell me more detail about how you evaluated it. For example, I would like to know (1) how the label was inferred, i.e. by using the huggingface generate function or by comparing the logits of the corresponding label tokens. (2) how you constructed the few-shot prompt in case of the other baseline LLMs (3) What is the format of the options, e.g., is the answer yes/no or (A), (B) for the BioASQ task? (4) Did you use task 7b to eval among the various sub-tasks of BioASQ? (5) If it is correct that you evaluated with task 7b, did you use the dataset provided by BLURB?

For my reproducing BioASQ, I added the BioASQ task 7b to the lm-eval-harness codebase. The code here uses loglikelihood for predicting labels instead of generating them. I also constructed random prompt from the BioASQ train/dev set for few-shot evaluation. And my prompt format is constructed as follows.

Context: <context>\nQuestion: <question>\nAnswer:

Below is an example for my few-shot evaluation:

Context: significant role of frailty as a predictor of depression in a (omitted)\nQuestion: Has depression been shown to be a predictor of frailty?\nAnswer: yes\n###\nContext: its perturbation will lead to human disease, (omitted)\nQuestion: Can TAD disruption lead to disease?\nAnswer:

If there is anything different from what you actually evaluated, I would be grateful if you could let me know this as well.

Thanks Ahn

AnthonyHartshorn commented 1 year ago

Hey,

1) Label was inferred by generating text and parsing a "Yes" or a "No" 2) I've just ran the prompting code. Few shot prompt sections constructed as follows:

its perturbation will lead to human disease, (omitted) \n\nQuestion: Can TAD disruption lead to disease? Yes or no?\n\nAnswer:

The differences are:

Let me know if this get's you to reproducibility or not!

AnthonyHartshorn commented 1 year ago

@SeonggwanAhn note that i just edited this ^^ to fix an error

SeonggwanAhn commented 1 year ago

Thank you for your detailed and quick answer. It helped me a lot. What I mean by question (4) is that, BioASQ has many subtasks(task 1a, 1b, 2a, 2b, ... 11a, 11b). I asked whether you chose task 7b Among them. But as you made it clear in the answer, my question resolved. Referring to your answer, I'll advance my evaluation strategy. Thanks a lot!!

SeonggwanAhn commented 1 year ago

Sorry, I've reopened this issue. I did an evaluation by designing the prompt exactly the way you suggested above. The result is 92.86% in 0-shot, which is 4.5% better than my previous evaluation, but about 1.5% less than the 94.3% (BioASQ) mentioned in the paper.

So I'm trying to find some reasons for the performance difference, and first one of them is when exceeding the max sequence length. Actually, some examples of BioASQ-0 shot have more than 2048 tokens. (I found that only one example in the BioASQ exceeds 2048 tokens)

How did you handle the case of exceeding the max sequence length (2048)? Currently, I am truncating the tokens from the left side of the prompt (start of sentence) to fit the max sequence length.

Other than that, do you have any ideas for the causes of difference of results? Thanks for the detailed answer earlier.

AnthonyHartshorn commented 1 year ago

Just letting you know, that i've seen this and will try and help you out on this. Just racking my brains back a year and checking code ...

AnthonyHartshorn commented 1 year ago

Have you removed the case from the "Yes" "No" in the answer?

AnthonyHartshorn commented 1 year ago

I'll have to get back to you on the context length question. What example exceeded the max sequence length?

AnthonyHartshorn commented 1 year ago

If i'm correct your implementation is getting 130/140 correct, and ours is getting 132/140. If that is so we're probably looking at 2 examples different

SeonggwanAhn commented 1 year ago

First of all, thanks for your attitude to help me.

(1) Whether I removed 'Yes', 'No' in the answer : No. I didn't remove it. I checked all the predicted answer. All answers were either 'Yes' or 'No'.

(2) The case that exceeded the max sequence length: 90th example in the test.json, where its question starts with 'Is Miller-Dieker syndrome associated with abnormalities'. This example has 2209 tokens when formatting (i.e., '\<sentence2>\n\nQuestion:\<sentence1> Yes or No?\n\nAnswer:' )

(3) Correct answers: yes, we are looking at 2 examples different. Do you think we need to compare the lists of the correct/wrong examples?

AnthonyHartshorn commented 1 year ago

Ok I'm just looking at the code now. We left truncate to 2020 tokens (leaving the remainder for the answer). If that doesn't get you reproducibility i'll share the tokenised text with you :-)

SeonggwanAhn commented 12 months ago

Ok thanks, first let me try it using 2020 tokens (sliced from left). If I don't get reproducibility, I'll just leave it because (1) my result is close to the paper's result and (2) there's no more factor to inspect. After additional evaluation(2020 tokens), I'll tell you the results.