Questions about the "Pool=50K" in your paper.

hkust-nlp / deita

Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]

Apache License 2.0

458 stars 28 forks source link

Questions about the "Pool=50K" in your paper. #12

Closed DLiquor closed 5 months ago

DLiquor commented 7 months ago

Hi, thanks for your work! I have some questions about your experiment in training the complexity scorer.

“Pool=50K” denotes the data selection procedure is conducted in a 50K-sized subset due to the cost of using ChatGPT to annotate the entire pool."

1、The data used for "EVOL COMPLEXITY (Pool=50K)" is sampled from 50K samples while that for "EVOL COMPLEXITY" is sampled from the original data pool? 2、How do you sample the data from the original data pool? Hope for your reply!

VPeterV commented 6 months ago

Hi, thank you for your interest, and I apologize for the delayed response.

The data for "EVOL COMPLEXITY (Pool=50K)" consists of 50,000 samples randomly sampled from the original dataset. Due to the high cost of using ChatGPT, we had to sample a subset of the data to enable a fair comparison with the results obtained using the direct scoring method, which relies on ChatGPT.
We randomly sampled this data from the original dataset.

Please do not hesitate to contact us if you have any further questions.

VPeterV commented 5 months ago

Close this issue for now. If you have any additional questions or concerns, please don't hesitate to reopen it.