Cannot Reproduce the Experimental Results

Kosei1227 commented 2 months ago

Hi!! Thank you for the excellent paper and wonderful results.

As researchers in low-resource languages, we want to reproduce the experimental results and apply/improve this Langbridge approach in our target languages.

We run the following code.

python eval_langbridge.py \
  --checkpoint_path kaist-ai/metamath-langbridge-9b\
  --enc_tokenizer kaist-ai/langbridge_encoder_tokenizer \
  --tasks mgsm_en,mgsm_es,mgsm_fr,mgsm_de,mgsm_ru,mgsm_zh,mgsm_ja,mgsm_th,mgsm_sw,mgsm_bn,mgsm_te\
  --instruction_template metamath \
  --batch_size 1 \
  --output_path eval_outputs/mgsm/metamath-langbridge_9b \
  --device cuda:2 \
  --no_cache

And we got this output.

kaist-ai/metamath-langbridge-9b (), limit: None, provide_description: False, num_fewshot: 0, batch_size: 1	Version	Metric	Value
mgsm_bn	acc	0.040	±	0.0124
mgsm_de	acc	0.108	±	0.0197
mgsm_en	acc	0.152	±	0.0228
mgsm_es	acc	0.084	±	0.0176
mgsm_fr	acc	0.096	±	0.0187
mgsm_ja	acc	0.060	±	0.0151
mgsm_ru	acc	0.068	±	0.0160
mgsm_sw	acc	0.024	±	0.0097
mgsm_te	acc	0.036	±	0.0118
mgsm_th	acc	0.076	±	0.0168
mgsm_zh	acc	0.048	±	0.0135

These values are quite lower than the output of the paper.

Have you faced this issue in the past? Could you tell me the script to run for the experimental results?

Thank you

Kosei1227 commented 2 months ago

Thank you for sharing this. My Chrome and Outlook detected and blocked viruses of the downloaded files/links. Could you share clean download links?

MattYoon commented 2 months ago

Hi @Kosei1227, thank you for reporting.

Unfortunately, I was not able to replicate your issue.

I only ran English for time sake using the following script

python eval_langbridge.py \
  --checkpoint_path kaist-ai/metamath-langbridge-9b \
  --enc_tokenizer kaist-ai/langbridge_encoder_tokenizer \
  --tasks mgsm_en\
  --instruction_template metamath \
  --batch_size 1 \
  --output_path eval_outputs/mgsm/metamath-langbridge_9b \
  --device cuda:0 \
  --no_cache

the result is

Task	Version	Metric	Value		Stderr
mgsm_en	0	acc	0.62	±	0.0308

MattYoon commented 2 months ago

Did you use fp16 precision by any chance?

As mT5 and LangBridge were both trained using bf16 precision, using fp16 precision for inference may result in odd behaviors. You either need to use bf16 or fp32.

Thank you for sharing this. My Chrome and Outlook detected and blocked viruses of the downloaded files/links. Could you share clean download links?

Not sure what you mean by this?

ayushayush591 commented 2 months ago

@Kosei1227 Trying changing Transformer version specified in requirement.txt once, that will fix the issue.

Kosei1227 commented 2 months ago

Thank you so much!! Changing the transformer version according to requirement.txt worked!

Here are the results I got for a future reference.

kaist-ai/metamath-langbridge-15b (), limit: None, provide_description: False, num_fewshot: 0, batch_size: 1	Version	Metric	Value
mgsm_bn	acc	0.416	±	0.0312
mgsm_de	acc	0.620	±	0.0308
mgsm_en	acc	0.684	±	0.0295
mgsm_es	acc	0.640	±	0.0304
mgsm_fr	acc	0.612	±	0.0309
mgsm_ja	acc	0.408	±	0.0311
mgsm_ru	acc	0.612	±	0.0309
mgsm_sw	acc	0.504	±	0.0317
mgsm_te	acc	0.344	±	0.0301
mgsm_th	acc	0.508	±	0.0317
mgsm_zh	acc	0.480	±	0.0317

kaistAI / LangBridge

Cannot Reproduce the Experimental Results #11