To assess the extent of Latin knowledge ingrained in OpenAI's Davinci GPT-3 model, as well as its ability to internalize correct responses during the fine-tuning process, we employed three distinct models, fine-tuned with quiz data from the PENSVM-A type.
Our selection of Chapter 5 as the evaluation dataset was predicated on the performance metrics of the baseline Davinci model, which exhibited its poorest performance in this chapter. Consequently, this choice is expected to provide more insights into the models' performance post fine-tuning.
Model 1:
Accuracy Rate: 0.0
This particular model emanates from the original Davinci GPT-3 model developed by OpenAI and was subjected to fine-tuning with the initial five questions of the PENSVM-A quiz focusing on Chapter 5, incorporating all nine prompt styles. The evaluation phase involved the remaining questions within the same chapter and employed a similar gamut of prompt styles.
Model 2:
Accuracy Rate: 0.0
In a similar vein, this model also builds upon the Davinci GPT-3 foundation. However, it underwent fine-tuning utilizing the first five questions from Chapters 1 to 5 of the PENSVM-A quizzes, encompassing all nine prompt styles. The evaluation was conducted in line with Model 1, utilizing the residual questions from the PENSVM-A quiz in Chapter 5.
The observed decrease in accuracy rates in both fine-tuned models relative to the baseline model is not entirely unexpected. This can be attributed to the constrained fine-tuning data, which was limited to the initial five questions from Chapter 5 or Chapters 1 through 5. It can be inferred that the baseline model lacks Latin knowledge. Furthermore, due to the recurrence of certain responses as correct answers within the training data, it is noticed that the 2 fine-tuned models generated responses that matched those deemed correct in the training data, hence the decline in accuracy.
Model 3:
Accuracy Rate for the Excluded Prompt Style (Prompt 6): 0.47
Accuracy Rate for an Included Prompt Style (Prompt 5): 0.67
Model 3 is constructed on the foundation of the Davinci GPT-3 model. However, it distinguishes itself by being trained on all questions from the PENSVM-A quiz in Chapter 5 across eight (N-1) prompt styles, intentionally excluding prompt style 6. The exclusion of prompt style 6 was premised on the observation that the baseline model exhibited optimal performance on this particular prompt style. Two distinct evaluations were executed to assess the model.
In the first evaluation, the model was tested on all questions from PENSVM-A Chapter 5 utilizing the excluded prompt style 6. Contrary to initial expectations of an accuracy rate approaching 1 - under the assumption that the fine-tuned model would be cognizant of the correct answers for the testing questions - the accuracy rate is a mere 0.47. Two potential explanations were postulated. Firstly, the heterogeneity among different prompt styles might have confounded the model, rendering it incapable of discerning the correct answers. Alternatively, the Davinci baseline model may not possess the capacity to retain the correct answers from the fine-tuning data, consequently faltering when faced with identical questions in the testing phase.
To ascertain which of these conjectures was valid, a second evaluation was conducted. This evaluation engaged all questions from PENSVM-A Chapter 5 using one of the prompt styles being included in the fine-tuning training data, which was randomly selected and happened to be prompt 5. The model demonstrated an accuracy rate of 0.67 in this instance. This result implies that the Davinci model does not possess the ability to retain all the correct answers presented during fine-tuning. Nevertheless, the variance in accuracy between the two evaluations lends credence to the notion that differences in prompt styles may also play a role in the model’s performance.
To assess the extent of Latin knowledge ingrained in OpenAI's Davinci GPT-3 model, as well as its ability to internalize correct responses during the fine-tuning process, we employed three distinct models, fine-tuned with quiz data from the PENSVM-A type. Our selection of Chapter 5 as the evaluation dataset was predicated on the performance metrics of the baseline Davinci model, which exhibited its poorest performance in this chapter. Consequently, this choice is expected to provide more insights into the models' performance post fine-tuning.
Model 1: Accuracy Rate: 0.0 This particular model emanates from the original Davinci GPT-3 model developed by OpenAI and was subjected to fine-tuning with the initial five questions of the PENSVM-A quiz focusing on Chapter 5, incorporating all nine prompt styles. The evaluation phase involved the remaining questions within the same chapter and employed a similar gamut of prompt styles.
Model 2: Accuracy Rate: 0.0 In a similar vein, this model also builds upon the Davinci GPT-3 foundation. However, it underwent fine-tuning utilizing the first five questions from Chapters 1 to 5 of the PENSVM-A quizzes, encompassing all nine prompt styles. The evaluation was conducted in line with Model 1, utilizing the residual questions from the PENSVM-A quiz in Chapter 5.
The observed decrease in accuracy rates in both fine-tuned models relative to the baseline model is not entirely unexpected. This can be attributed to the constrained fine-tuning data, which was limited to the initial five questions from Chapter 5 or Chapters 1 through 5. It can be inferred that the baseline model lacks Latin knowledge. Furthermore, due to the recurrence of certain responses as correct answers within the training data, it is noticed that the 2 fine-tuned models generated responses that matched those deemed correct in the training data, hence the decline in accuracy.
Model 3: Accuracy Rate for the Excluded Prompt Style (Prompt 6): 0.47 Accuracy Rate for an Included Prompt Style (Prompt 5): 0.67 Model 3 is constructed on the foundation of the Davinci GPT-3 model. However, it distinguishes itself by being trained on all questions from the PENSVM-A quiz in Chapter 5 across eight (N-1) prompt styles, intentionally excluding prompt style 6. The exclusion of prompt style 6 was premised on the observation that the baseline model exhibited optimal performance on this particular prompt style. Two distinct evaluations were executed to assess the model.
In the first evaluation, the model was tested on all questions from PENSVM-A Chapter 5 utilizing the excluded prompt style 6. Contrary to initial expectations of an accuracy rate approaching 1 - under the assumption that the fine-tuned model would be cognizant of the correct answers for the testing questions - the accuracy rate is a mere 0.47. Two potential explanations were postulated. Firstly, the heterogeneity among different prompt styles might have confounded the model, rendering it incapable of discerning the correct answers. Alternatively, the Davinci baseline model may not possess the capacity to retain the correct answers from the fine-tuning data, consequently faltering when faced with identical questions in the testing phase. To ascertain which of these conjectures was valid, a second evaluation was conducted. This evaluation engaged all questions from PENSVM-A Chapter 5 using one of the prompt styles being included in the fine-tuning training data, which was randomly selected and happened to be prompt 5. The model demonstrated an accuracy rate of 0.67 in this instance. This result implies that the Davinci model does not possess the ability to retain all the correct answers presented during fine-tuning. Nevertheless, the variance in accuracy between the two evaluations lends credence to the notion that differences in prompt styles may also play a role in the model’s performance.