[uprise] reproducing the experiments in the original paper is time-consuming

Cheungki commented 4 months ago

Thanks for your nice work!

Following your guidance, I'm trying to reproduce the results mentioned in your paper using the following command with get_cmds.py:

TRAIN_CLUSTERS='reading+close_qa+paraphrase+nli' # use `+` to concatenate your train clusters as a string
TEST_CLUSTERS='sentiment'  # use `+` to concatenate your test clusters as a string
SCORE_LLM=model_name_or_path  # LLM to score the data
INF_LLM=model_name_or_path # LLM for inference
OUTPUT_DIR=my_data
python get_cmds.py \
    --output_dir ${OUTPUT_DIR} \
    --train_clusters ${TRAIN_CLUSTERS} \
    --test_clusters ${TEST_CLUSTERS} \
    --scr_model ${SCORE_LLM} \
    --inf_model ${INF_LLM} \
    --multi_task \
    --gpus 4

After that, 14 datasets will be scored by running scorer.py in the generated train.sh. It shows that it takes around 7 hours to finish the first scoring script, which is so time-consuming.

Here are my questions:

I wonder if it is because of my mistake that it took too long.
As mentioned in Appendices B, 36 hours are needed to conduct the experiments. Does it takes the scoring process into account or just tuning the retriever?

cdxeve commented 3 months ago

Hi, it is not your mistake—it is normal for the scoring process to take that long. Scoring the data is indeed time-consuming, but there is a significant benefit: once you've scored the datasets, you can reuse them anytime you want to run the cross-task evaluation.

For example, suppose it takes about 100 hours to score the datasets for all task types (let's say you have task types A, B, C, and D). When you evaluate on task type A, you use the retriever trained on the scored datasets of B, C, and D. When you evaluate on task type B, you can use the retriever trained on A, C, and D. Here, the scored datasets of C and D are reused, so you do not need to score them again.

Regarding the 36 hours mentioned in Appendix B, this refers to the time required to train the retriever and does NOT include the scoring time. For instance, if you are evaluating on task type A, it would take about 36 hours to train a retriever on the scored datasets of B, C, and D.

It is indeed a lengthy process to perform cross-task evaluations. However, if your intention is to train task-specific retrievers, the time required is much more manageable. In this case, you only need to score one dataset and then train the retriever on that single dataset, which takes significantly less time.

Cheungki commented 3 months ago

Many thanks!

microsoft / LMOps

[uprise] reproducing the experiments in the original paper is time-consuming #247