Closed Cheungki closed 3 months ago
Hi, it is not your mistake—it is normal for the scoring process to take that long. Scoring the data is indeed time-consuming, but there is a significant benefit: once you've scored the datasets, you can reuse them anytime you want to run the cross-task evaluation.
For example, suppose it takes about 100 hours to score the datasets for all task types (let's say you have task types A, B, C, and D). When you evaluate on task type A, you use the retriever trained on the scored datasets of B, C, and D. When you evaluate on task type B, you can use the retriever trained on A, C, and D. Here, the scored datasets of C and D are reused, so you do not need to score them again.
Regarding the 36 hours mentioned in Appendix B, this refers to the time required to train the retriever and does NOT include the scoring time. For instance, if you are evaluating on task type A, it would take about 36 hours to train a retriever on the scored datasets of B, C, and D.
It is indeed a lengthy process to perform cross-task evaluations. However, if your intention is to train task-specific retrievers, the time required is much more manageable. In this case, you only need to score one dataset and then train the retriever on that single dataset, which takes significantly less time.
Many thanks!
Thanks for your nice work!
Following your guidance, I'm trying to reproduce the results mentioned in your paper using the following command with
get_cmds.py
:After that, 14 datasets will be scored by running
scorer.py
in the generatedtrain.sh
. It shows that it takes around 7 hours to finish the first scoring script, which is so time-consuming.Here are my questions: