lz1oceani / verify_cot

Creative Commons Attribution 4.0 International
123 stars 8 forks source link

Standard of invalid reasoning. #3

Open AegeanYan opened 11 months ago

AegeanYan commented 11 months ago

Could you please tell me how you pick out the invalid reasoning process? I found some rationale in gsm8k_100 valid but the flag is 0. For example:

question: Hannah needs to drink 60 ml of water for each kilometer she runs. If her gym teacher tells her to run 8 laps and each lap is 0.25 km, how many milliliters of water will Hannah need to drink?

answer: #1. Hannah needs to drink 60 ml of water for each kilometer she runs.

2. Each lap is 0.25 km.

3. Hannah needs to run 8 laps.

4. How many milliliters of water will Hannah need to drink?

5. (by #2) Step 1: Calculate the distance of 8 laps.

Distance of each lap: 0.25 km Total laps: 8 Total distance: 0.25 km * 8 = 2 km

6. (by #1) Step 2: Calculate the amount of water Hannah needs to drink for 2 km.

Water needed for 1 km: 60 ml Total distance: 2 km Water needed for 2 km: 60 ml * 2 = 120 ml

7. (by #4 #6) The original question is #4. How many milliliters of water will Hannah need to drink? We do not miss information on the rewritten labels. So the answer to this question is Hannah will need to drink 120 ml of water.

gt_answer First find the total number of kilometers Hannah runs: 8 laps .25 km/lap = <<8.25=2>>2 km Then multiply the number of kilometers she runs by the amount of water she needs per kilometer to find the total amount of water she needs: 60 ml/km 2 km = <<602=120>>120 ml

120

lz1oceani commented 11 months ago

It contains reference errors. For example, #5 should refer #2 and #3 and #6 should refer #1 and #5.

AegeanYan commented 11 months ago

I found almost all 15 false-positive cases are the reference error?

lz1oceani commented 11 months ago

In most cases, reference errors are the main cause of false positives.

AegeanYan commented 11 months ago

I viewed the first few cases and the last case in gsm8k among 15 cases where flag = 0 and answer is correct. But I found they seems all reference error? Could you make a label to this in your data, I think it's confusing if you mix this two type problem in your experiment.