hsiehjackson / RULER

This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?
Apache License 2.0
555 stars 36 forks source link

GPT-4-1106-preview #63

Closed yxgcsq closed 5 days ago

yxgcsq commented 1 week ago

GPT-4-1106-preview What are the scores for each of the 13 tasks in the 4k test

yxgcsq commented 1 week ago

image This is the score I measured

hsiehjackson commented 1 week ago
Tasks niah_single_1 niah_single_2 niah_single_3 niah_multikey_1 niah_multikey_2 niah_multikey_3 niah_multivalue niah_multiquery vt fwe cwe qa_1 qa_2
Score 100.0 100.0 100.0 100.0 100.0 100.0 99.5 100.0 100.0 98.0 100.0 88.0 70.0

These are my results in the 4k test. Can you check the responses in variable tracking and common word extraction? I think maybe the API returns something unexpected.

yxgcsq commented 5 days ago

Thank you thank you