My results in open-domain QA are much lower using the given checkpoint for CEPE-LLaMA-2-7B. Could you provide some insights into the potential causes for this decline?

princeton-nlp / CEPE

[ACL 2024] Long-Context Language Modeling with Parallel Encodings

https://arxiv.org/abs/2402.16617

MIT License

135 stars 9 forks source link

My results in open-domain QA are much lower using the given checkpoint for CEPE-LLaMA-2-7B. Could you provide some insights into the potential causes for this decline? #1

Closed sunnynexus closed 3 months ago

sunnynexus commented 7 months ago

I'm curious about the discrepancies between my results (in red font) and the results presented in your paper (in black font), both obtained using the default parameters with the run_qa.sh script.

Could there be any potential errors on my end that could explain these differences?

howard-yen commented 7 months ago

Hi, thanks for your interest in our work. For CEPE at k = 10, we only use and put all the passages in the decoder model, which should match the results for LLaMA-2. There might have been a mistake in the config file, which I will look into. Are you also using the QA files from the google drive?

sunnynexus commented 7 months ago

Hi, thanks for your interest in our work. For CEPE at k = 10, we only use and put all the passages in the decoder model, which should match the results for LLaMA-2. There might have been a mistake in the config file, which I will look into. Are you also using the QA files from the google drive?

Thank you for your reply. Yes, I used the QA files from the google drive.

sunnynexus commented 7 months ago

I have tried running it multiple times, but the results are still not superior to the basic llama-2-7b model.