Closed teetone closed 2 years ago
Can you print the original reference and original span? I think that would be very helpful to understand the bug.
Can you print the original reference and original span? I think that would be very helpful to understand the bug.
Sorry, how would that help? It looks like the only difference is the "There" in the beginning.
@yuhui-zh15 It's also reproducible with openai/text-babbage-001
:
Error when running commonsense:model=full_functionality_text,dataset=hellaswag,method=multiple_choice_separate_calibrated:
Traceback (most recent call last):
File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/presentation/present.py", line 98, in run
new_run_specs = run_benchmarking(
File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/run.py", line 60, in run_benchmarking
runner.run_all()
File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/runner.py", line 90, in run_all
self.run_one(run_spec)
File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/runner.py", line 129, in run_one
metric_result: MetricResult = metric.evaluate(
File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/metrics/metric.py", line 150, in evaluate
results: List[List[Stat]] = parallel_map(
File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/common/general.py", line 183, in parallel_map
results: List = list(tqdm(executor.map(process, items), total=len(items)))
File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/lib/python3.8/site-packages/tqdm/std.py", line 1195, in __iter__
for obj in iterable:
File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator
yield fs.pop().result()
File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/lib/python3.8/concurrent/futures/_base.py", line 437, in result
return self.__get_result()
File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/metrics/metric.py", line 88, in process
self.metric.evaluate_references(
File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/metrics/basic_metrics.py", line 693, in evaluate_references
reference_stats[reference_key] = compute_logprob_and_length(request_state)
File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/metrics/basic_metrics.py", line 673, in compute_logprob_and_length
assert filtered_span == filtered_reference, f"Expected: {filtered_reference}, Actual: {filtered_span}"
AssertionError: Expected: stepUsingavendorspecificbeanpeelercutthebroccoliintoinchwidestripsThisworksbestwhenyouusecucumbers, Actual: vendorspecificbeanpeelercutthebroccoliintobytesxc2bytesxbcinchwidestripsThisworksbestwhenyouusecucumbers
Hi, I don’t think we can understand the bug without printing original input. Perhaps this model uses weird tokenizer, so the token length != real reference length.
Hi, I don’t think we can understand the bug without printing original input. Perhaps this model uses weird tokenizer, so the token length != real reference length.
I was able to reproduce with openai/text-babbage-001
, which uses the GPT-2 tokenizer. Could you try running with openai/text-babbage-001
?
Can you print the original reference and original span? I think that would be very helpful to understand the bug.
Sorry, how would that help? It looks like the only difference is the "There" in the beginning.
Also, doesn't look like a weird tokenization error to me. The only difference is the word "There".
Hi, I don’t think we can understand the bug without printing original input. Perhaps this model uses weird tokenizer, so the token length != real reference length.
I think you mentioned that the check was added for debugging purposes. What if we just remove the check? Would it still be correct?
I found another example with the AI21 models:
assert filtered_span == filtered_reference, f"Expected: {filtered_reference}, Actual: {filtered_span}"
AssertionError: Expected: alsosharesinformationonhowthingschangedlaterwhenshewasfinallyallowedtoparticipatefreely, Actual: Shealsosharesinformationonhowthingschangedlaterwhenshewasfinallyallowedtoparticipatefreely
It seems to be always missing the first word or token.
I proposed a fix in #820 but haven't verified the correctness.
Okay, now I can reproduce this bug by the following command and input files.
venv/bin/benchmark-run -r commonsense:model=openai/text-babbage-001,dataset=hellaswag,method=multiple_choice_separate_calibrated --suite 0903
hellaswag_val.jsonl
{"ind": 11289, "ctx": "[header] How to steam broccoli without a steamer [title] Wash your broccoli. [step] Check for insects. [title] Cut the broccoli as desired.", "activity_label": "Food and Entertaining", "ctx_a": "[header] How to steam broccoli without a steamer [title] Wash your broccoli. [step] Check for insects. [title] Cut the broccoli as desired.", "ctx_b": "", "split": "val", "split_type": "indomain", "label": 2, "endings": ["[step] Using a vendor-specific bean peeler, cut the broccoli into \u00bc inch-wide strips. This works best when you use cucumbers.", "[title] Wash each piece inside out then place them in a steamer bag. [step] Save the white part if you plan to steam your broccoli afterward.", "[step] Cutting the broccoli into small pieces will help it to cook faster. [substeps] If you want to eat the stalks, they should be cut into pieces that are slightly smaller than the florets.", "[step] Rinse, drain the water, and cut once done. [title] Place your cooked broccoli into a steamer basket/pot."], "source_id": "wikihow~18617"}
And I now understand what happened by printing the original input:
AssertionError:
Expected: [step] Using a vendor-specific bean peeler, cut the broccoli into ¼ inch-wide strips. This works best when you use cucumbers.
Actual: vendor-specific bean peeler, cut the broccoli intobytes: \xc2bytes:\xbc inch-wide strips. This works best when you use cucumbers.
So the ¼ (1 char) is extended to multiple characters by the tokenizer. Therefore, counting span length will lead to the error.
The pull request #820 cannot fix this problem as well, because it will never reach the loop break condition.
Do you have any suggestions for this? @percyliang @teetone
It seems the only perfect solution is to directly get the tokenized choices and compute len(tokenized_choice)
? Or another possible solution is to filter out all the non-ascii input chars?..
together/gpt-j-6b
case seems much more complex, and seems the bug is from other parts.
Error:
AssertionError:
Expected: There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune's mass.
----> 245 chars
Actual: ĠareĠtwoĠbrandsĠofĠhydrometersĠavailableĠonĠtheĠmarket,ĠandĠtheĠonesĠapprovedĠbyĠinternationalĠstandardsĠareĠ:Ġ[substeps]ĠDiichydraseĠ(blueĠorĠgreyĠwithoutĠtheĠlabelĠ"ĠsodiumĠhydrometerĠ").ĠTheĠdiichydraseĠmeterĠmeasuresĠtheĠneptune'sĠmass.
----> 240 chars
Why the actual length < expected length? The loop will only exit if actual length >= expected length..
It seems the request is wrong. Here is the raw output from print(request_state)
:
RequestState(instance=Instance(input='Education and Communications: [header] How to calibrate a hydrometer [title] Identify the parts of the hydrometer. [step] A hydrometer is a glass device that has a bulbous, weighted end designed to float in a liquid and a narrow, long stem with a graduated scale on the other end. It is used to measure the specific gravity of a liquid. ', references=[Reference(output='[substeps] You can get a hydrometer that is labeled for gas both from the fermentation process, and from the reading of light and air. The rated gas for fuel is 44.99 °.', tags=[]), Reference(output='Specific gravity is the density of a liquid compared to water. [substeps] The bulbous end is placed into the liquid in question while the narrow stem will stick out of the liquid.', tags=['correct']), Reference(output='[substeps] Hydrometers are often made of silver or stainless steel. When finished, the metal is much more solid, and will generally be made of real metal.', tags=[]), Reference(output='There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', tags=[])], split='valid', sub_split=None, id='id39905', perturbation=None, contrast_inputs=None, contrast_references=None), reference_index=3, request_mode='calibration', train_trial_index=0, output_mapping=None, request=Request(model='together/gpt-j-6b', prompt='Answer: There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', temperature=0, num_completions=1, top_k_per_token=1, max_tokens=0, stop_sequences=[], echo_prompt=True, top_p=1, presence_penalty=0, frequency_penalty=0, random=None), result=RequestResult(success=True, completions=[Sequence(text=' are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', logprob=-259.1124801635742, tokens=[Token(text='Ġare', logprob=0, top_logprobs={}), Token(text='Ġtwo', logprob=-5.34375, top_logprobs={'Ġthe': -3.0546875}), Token(text='Ġbrands', logprob=-8.9453125, top_logprobs={'Ġways': -3.068359375}), Token(text='Ġof', logprob=-0.88818359375, top_logprobs={'Ġof': -0.88818359375}), Token(text='Ġhyd', logprob=-8.3515625, top_logprobs={'Ġthe': -3.28125}), Token(text='rom', logprob=-2.279296875, top_logprobs={'roc': -1.3095703125}), Token(text='eters', logprob=-3.111328125, top_logprobs={'or': -0.2205810546875}), Token(text='Ġavailable', logprob=-3.212890625, top_logprobs={',': -1.759765625}), Token(text='Ġon', logprob=-2.00390625, top_logprobs={'.': -1.9560546875}), Token(text='Ġthe', logprob=-0.128173828125, top_logprobs={'Ġthe': -0.128173828125}), Token(text='Ġmarket', logprob=-0.0716552734375, top_logprobs={'Ġmarket': -0.0716552734375}), Token(text=',', logprob=-1.689453125, top_logprobs={'.': -1.158203125}), Token(text='Ġand', logprob=-2.583984375, top_logprobs={'Ġthe': -1.5615234375}), Token(text='Ġthe', logprob=-2.484375, top_logprobs={'Ġthey': -1.6640625}), Token(text='Ġones', logprob=-5.0078125, top_logprobs={'Ġone': -2.638671875}), Token(text='Ġapproved', logprob=-8.640625, top_logprobs={'ĠI': -1.529296875}), Token(text='Ġby', logprob=-0.2919921875, top_logprobs={'Ġby': -0.2919921875}), Token(text='Ġinternational', logprob=-6.9140625, top_logprobs={'Ġthe': -0.467529296875}), Token(text='Ġstandards', logprob=-1.296875, top_logprobs={'Ġstandards': -1.296875}), Token(text='Ġare', logprob=-0.82080078125, top_logprobs={'Ġare': -0.82080078125}), Token(text='Ġ:', logprob=-7.609375, top_logprobs={'Ġthe': -2.056640625}), Token(text='Ġ[', logprob=-8.3984375, top_logprobs={'Ċ': -0.80908203125}), Token(text='sub', logprob=-9.125, top_logprobs={'Table': -2.12109375}), Token(text='steps', logprob=-10.2421875, top_logprobs={'scription': -1.279296875}), Token(text=']', logprob=-1.048828125, top_logprobs={']': -1.048828125}), Token(text='ĠDi', logprob=-9.375, top_logprobs={'Ċ': -1.5693359375}), Token(text='ich', logprob=-12.6953125, top_logprobs={'ast': -0.93896484375}), Token(text='yd', logprob=-6.3828125, top_logprobs={'rom': -0.5234375}), Token(text='rase', logprob=-11.578125, top_logprobs={'rom': -0.47119140625}), Token(text='Ġ(', logprob=-3.05859375, top_logprobs={',': -2.00390625}), Token(text='blue', logprob=-8.875, top_logprobs={'D': -2.4921875}), Token(text='Ġor', logprob=-4.63671875, top_logprobs={')': -1.1455078125}), Token(text='Ġgrey', logprob=-3.5703125, top_logprobs={'Ġgreen': -1.8046875}), Token(text='Ġwithout', logprob=-9.0546875, top_logprobs={')': -1.056640625}), Token(text='Ġthe', logprob=-2.48046875, top_logprobs={'Ġa': -2.08203125}), Token(text='Ġlabel', logprob=-4.19921875, top_logprobs={'Ġletter': -3.123046875}), Token(text='Ġ"', logprob=-3.943359375, top_logprobs={')': -1.021484375}), Token(text='Ġsodium', logprob=-12.09375, top_logprobs={'Di': -3.234375}), Token(text='Ġhyd', logprob=-2.13671875, top_logprobs={'Ġchloride': -1.9267578125}), Token(text='rom', logprob=-3.501953125, top_logprobs={'rox': -0.0810546875}), Token(text='eter', logprob=-0.2010498046875, top_logprobs={'eter': -0.2010498046875}), Token(text='Ġ"', logprob=-1.9638671875, top_logprobs={'"': -1.4326171875}), Token(text=').', logprob=-3.5078125, top_logprobs={')': -1.484375}), Token(text='ĠThe', logprob=-2.291015625, top_logprobs={'Ġ[': -2.158203125}), Token(text='Ġdi', logprob=-5.1015625, top_logprobs={'Ġother': -2.2890625}), Token(text='ich', logprob=-1.560546875, top_logprobs={'hyd': -0.654296875}), Token(text='yd', logprob=-0.00273895263671875, top_logprobs={'yd': -0.00273895263671875}), Token(text='rase', logprob=-0.1517333984375, top_logprobs={'rase': -0.1517333984375}), Token(text='Ġmeter', logprob=-7.93359375, top_logprobs={'Ġis': -1.326171875}), Token(text='Ġmeasures', logprob=-4.2578125, top_logprobs={'Ġis': -0.80615234375}), Token(text='Ġthe', logprob=-0.48291015625, top_logprobs={'Ġthe': -0.48291015625}), Token(text='Ġne', logprob=-14.765625, top_logprobs={'Ġdensity': -1.13671875}), Token(text='pt', logprob=-2.80859375, top_logprobs={'ph': -1.26953125}), Token(text='une', logprob=-4.0703125, top_logprobs={'un': -0.236328125}), Token(text="'s", logprob=-2.791015625, top_logprobs={'Ġof': -2.775390625}), Token(text='Ġmass', logprob=-6.21875, top_logprobs={'Ġspecific': -0.98486328125}), Token(text='.', logprob=-2.931640625, top_logprobs={'Ġin': -1.369140625})], finish_reason={'reason': 'length'})], cached=True, request_time=0, request_datetime=None, error=None, batch_size=245, batch_request_time=4.580406188964844), num_in_context_examples=0, input_truncated=False, num_conditioning_tokens=0)
Note the part highlighted with *
:
RequestState(...,
request=Request(
model='together/gpt-j-6b',
prompt=*'Answer: There are two brands* of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.',
temperature=0,
num_completions=1,
top_k_per_token=1,
max_tokens=0,
stop_sequences=[],
echo_prompt=True,
top_p=1,
presence_penalty=0,
frequency_penalty=0,
random=None
),
result=RequestResult(
success=True,
completions=[Sequence(text=*' are two brands* of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', logprob=-259.1124801635742, tokens=[...]]
...
)
I'm not sure why result.completions
from GPT-J have several missing tokens: Answer: There
, while these tokens clearly exist in the request.prompt
.
Okay, now I can reproduce this bug by the following command and input files.
venv/bin/benchmark-run -r commonsense:model=openai/text-babbage-001,dataset=hellaswag,method=multiple_choice_separate_calibrated --suite 0903
hellaswag_val.jsonl {"ind": 11289, "ctx": "[header] How to steam broccoli without a steamer [title] Wash your broccoli. [step] Check for insects. [title] Cut the broccoli as desired.", "activity_label": "Food and Entertaining", "ctx_a": "[header] How to steam broccoli without a steamer [title] Wash your broccoli. [step] Check for insects. [title] Cut the broccoli as desired.", "ctx_b": "", "split": "val", "split_type": "indomain", "label": 2, "endings": ["[step] Using a vendor-specific bean peeler, cut the broccoli into \u00bc inch-wide strips. This works best when you use cucumbers.", "[title] Wash each piece inside out then place them in a steamer bag. [step] Save the white part if you plan to steam your broccoli afterward.", "[step] Cutting the broccoli into small pieces will help it to cook faster. [substeps] If you want to eat the stalks, they should be cut into pieces that are slightly smaller than the florets.", "[step] Rinse, drain the water, and cut once done. [title] Place your cooked broccoli into a steamer basket/pot."], "source_id": "wikihow~18617"}
And I now understand what happened by printing the original input:
AssertionError: Expected: [step] Using a vendor-specific bean peeler, cut the broccoli into ¼ inch-wide strips. This works best when you use cucumbers. Actual: vendor-specific bean peeler, cut the broccoli intobytes: \xc2bytes:\xbc inch-wide strips. This works best when you use cucumbers.
So the ¼ (1 char) is extended to multiple characters by the tokenizer. Therefore, counting span length will lead to the error.
The pull request #820 cannot fix this problem as well, because it will never reach the loop break condition.
Do you have any suggestions for this? @percyliang @teetone
It seems the only perfect solution is to directly get the tokenized choices and compute
len(tokenized_choice)
? Or another possible solution is to filter out all the non-ascii input chars?..
It's also missing Using a
in the beginning, right? Do you know why that is? I think the main problem is the missing couple of tokens in the beginning.
together/gpt-j-6b
case seems much more complex, and seems the bug is from other parts.Error:
AssertionError: Expected: There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune's mass. ----> 245 chars Actual: ĠareĠtwoĠbrandsĠofĠhydrometersĠavailableĠonĠtheĠmarket,ĠandĠtheĠonesĠapprovedĠbyĠinternationalĠstandardsĠareĠ:Ġ[substeps]ĠDiichydraseĠ(blueĠorĠgreyĠwithoutĠtheĠlabelĠ"ĠsodiumĠhydrometerĠ").ĠTheĠdiichydraseĠmeterĠmeasuresĠtheĠneptune'sĠmass. ----> 240 chars
Why the actual length < expected length? The loop will only exit if actual length >= expected length..
It seems the request is wrong. Here is the raw output from
print(request_state)
:RequestState(instance=Instance(input='Education and Communications: [header] How to calibrate a hydrometer [title] Identify the parts of the hydrometer. [step] A hydrometer is a glass device that has a bulbous, weighted end designed to float in a liquid and a narrow, long stem with a graduated scale on the other end. It is used to measure the specific gravity of a liquid. ', references=[Reference(output='[substeps] You can get a hydrometer that is labeled for gas both from the fermentation process, and from the reading of light and air. The rated gas for fuel is 44.99 °.', tags=[]), Reference(output='Specific gravity is the density of a liquid compared to water. [substeps] The bulbous end is placed into the liquid in question while the narrow stem will stick out of the liquid.', tags=['correct']), Reference(output='[substeps] Hydrometers are often made of silver or stainless steel. When finished, the metal is much more solid, and will generally be made of real metal.', tags=[]), Reference(output='There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', tags=[])], split='valid', sub_split=None, id='id39905', perturbation=None, contrast_inputs=None, contrast_references=None), reference_index=3, request_mode='calibration', train_trial_index=0, output_mapping=None, request=Request(model='together/gpt-j-6b', prompt='Answer: There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', temperature=0, num_completions=1, top_k_per_token=1, max_tokens=0, stop_sequences=[], echo_prompt=True, top_p=1, presence_penalty=0, frequency_penalty=0, random=None), result=RequestResult(success=True, completions=[Sequence(text=' are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', logprob=-259.1124801635742, tokens=[Token(text='Ġare', logprob=0, top_logprobs={}), Token(text='Ġtwo', logprob=-5.34375, top_logprobs={'Ġthe': -3.0546875}), Token(text='Ġbrands', logprob=-8.9453125, top_logprobs={'Ġways': -3.068359375}), Token(text='Ġof', logprob=-0.88818359375, top_logprobs={'Ġof': -0.88818359375}), Token(text='Ġhyd', logprob=-8.3515625, top_logprobs={'Ġthe': -3.28125}), Token(text='rom', logprob=-2.279296875, top_logprobs={'roc': -1.3095703125}), Token(text='eters', logprob=-3.111328125, top_logprobs={'or': -0.2205810546875}), Token(text='Ġavailable', logprob=-3.212890625, top_logprobs={',': -1.759765625}), Token(text='Ġon', logprob=-2.00390625, top_logprobs={'.': -1.9560546875}), Token(text='Ġthe', logprob=-0.128173828125, top_logprobs={'Ġthe': -0.128173828125}), Token(text='Ġmarket', logprob=-0.0716552734375, top_logprobs={'Ġmarket': -0.0716552734375}), Token(text=',', logprob=-1.689453125, top_logprobs={'.': -1.158203125}), Token(text='Ġand', logprob=-2.583984375, top_logprobs={'Ġthe': -1.5615234375}), Token(text='Ġthe', logprob=-2.484375, top_logprobs={'Ġthey': -1.6640625}), Token(text='Ġones', logprob=-5.0078125, top_logprobs={'Ġone': -2.638671875}), Token(text='Ġapproved', logprob=-8.640625, top_logprobs={'ĠI': -1.529296875}), Token(text='Ġby', logprob=-0.2919921875, top_logprobs={'Ġby': -0.2919921875}), Token(text='Ġinternational', logprob=-6.9140625, top_logprobs={'Ġthe': -0.467529296875}), Token(text='Ġstandards', logprob=-1.296875, top_logprobs={'Ġstandards': -1.296875}), Token(text='Ġare', logprob=-0.82080078125, top_logprobs={'Ġare': -0.82080078125}), Token(text='Ġ:', logprob=-7.609375, top_logprobs={'Ġthe': -2.056640625}), Token(text='Ġ[', logprob=-8.3984375, top_logprobs={'Ċ': -0.80908203125}), Token(text='sub', logprob=-9.125, top_logprobs={'Table': -2.12109375}), Token(text='steps', logprob=-10.2421875, top_logprobs={'scription': -1.279296875}), Token(text=']', logprob=-1.048828125, top_logprobs={']': -1.048828125}), Token(text='ĠDi', logprob=-9.375, top_logprobs={'Ċ': -1.5693359375}), Token(text='ich', logprob=-12.6953125, top_logprobs={'ast': -0.93896484375}), Token(text='yd', logprob=-6.3828125, top_logprobs={'rom': -0.5234375}), Token(text='rase', logprob=-11.578125, top_logprobs={'rom': -0.47119140625}), Token(text='Ġ(', logprob=-3.05859375, top_logprobs={',': -2.00390625}), Token(text='blue', logprob=-8.875, top_logprobs={'D': -2.4921875}), Token(text='Ġor', logprob=-4.63671875, top_logprobs={')': -1.1455078125}), Token(text='Ġgrey', logprob=-3.5703125, top_logprobs={'Ġgreen': -1.8046875}), Token(text='Ġwithout', logprob=-9.0546875, top_logprobs={')': -1.056640625}), Token(text='Ġthe', logprob=-2.48046875, top_logprobs={'Ġa': -2.08203125}), Token(text='Ġlabel', logprob=-4.19921875, top_logprobs={'Ġletter': -3.123046875}), Token(text='Ġ"', logprob=-3.943359375, top_logprobs={')': -1.021484375}), Token(text='Ġsodium', logprob=-12.09375, top_logprobs={'Di': -3.234375}), Token(text='Ġhyd', logprob=-2.13671875, top_logprobs={'Ġchloride': -1.9267578125}), Token(text='rom', logprob=-3.501953125, top_logprobs={'rox': -0.0810546875}), Token(text='eter', logprob=-0.2010498046875, top_logprobs={'eter': -0.2010498046875}), Token(text='Ġ"', logprob=-1.9638671875, top_logprobs={'"': -1.4326171875}), Token(text=').', logprob=-3.5078125, top_logprobs={')': -1.484375}), Token(text='ĠThe', logprob=-2.291015625, top_logprobs={'Ġ[': -2.158203125}), Token(text='Ġdi', logprob=-5.1015625, top_logprobs={'Ġother': -2.2890625}), Token(text='ich', logprob=-1.560546875, top_logprobs={'hyd': -0.654296875}), Token(text='yd', logprob=-0.00273895263671875, top_logprobs={'yd': -0.00273895263671875}), Token(text='rase', logprob=-0.1517333984375, top_logprobs={'rase': -0.1517333984375}), Token(text='Ġmeter', logprob=-7.93359375, top_logprobs={'Ġis': -1.326171875}), Token(text='Ġmeasures', logprob=-4.2578125, top_logprobs={'Ġis': -0.80615234375}), Token(text='Ġthe', logprob=-0.48291015625, top_logprobs={'Ġthe': -0.48291015625}), Token(text='Ġne', logprob=-14.765625, top_logprobs={'Ġdensity': -1.13671875}), Token(text='pt', logprob=-2.80859375, top_logprobs={'ph': -1.26953125}), Token(text='une', logprob=-4.0703125, top_logprobs={'un': -0.236328125}), Token(text="'s", logprob=-2.791015625, top_logprobs={'Ġof': -2.775390625}), Token(text='Ġmass', logprob=-6.21875, top_logprobs={'Ġspecific': -0.98486328125}), Token(text='.', logprob=-2.931640625, top_logprobs={'Ġin': -1.369140625})], finish_reason={'reason': 'length'})], cached=True, request_time=0, request_datetime=None, error=None, batch_size=245, batch_request_time=4.580406188964844), num_in_context_examples=0, input_truncated=False, num_conditioning_tokens=0)
Note the part highlighted with
*
:RequestState(..., request=Request( model='together/gpt-j-6b', prompt=*'Answer: There are two brands* of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', temperature=0, num_completions=1, top_k_per_token=1, max_tokens=0, stop_sequences=[], echo_prompt=True, top_p=1, presence_penalty=0, frequency_penalty=0, random=None ), result=RequestResult( success=True, completions=[Sequence(text=*' are two brands* of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', logprob=-259.1124801635742, tokens=[...]] ... )
I'm not sure why
result.completions
from GPT-J have several missing tokens:Answer: There
, while these tokens clearly exist in therequest.prompt
.
@LorrinWWW Do you have any insight to this? This is when echo=True
and max_tokens=0
for together/gpt-j-6b
.
together/gpt-j-6b
case seems much more complex, and seems the bug is from other parts.
Error:AssertionError: Expected: There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune's mass. ----> 245 chars Actual: ĠareĠtwoĠbrandsĠofĠhydrometersĠavailableĠonĠtheĠmarket,ĠandĠtheĠonesĠapprovedĠbyĠinternationalĠstandardsĠareĠ:Ġ[substeps]ĠDiichydraseĠ(blueĠorĠgreyĠwithoutĠtheĠlabelĠ"ĠsodiumĠhydrometerĠ").ĠTheĠdiichydraseĠmeterĠmeasuresĠtheĠneptune'sĠmass. ----> 240 chars
Why the actual length \< expected length? The loop will only exit if actual length >= expected length..
It seems the request is wrong. Here is the raw output fromprint(request_state)
:RequestState(instance=Instance(input='Education and Communications: [header] How to calibrate a hydrometer [title] Identify the parts of the hydrometer. [step] A hydrometer is a glass device that has a bulbous, weighted end designed to float in a liquid and a narrow, long stem with a graduated scale on the other end. It is used to measure the specific gravity of a liquid. ', references=[Reference(output='[substeps] You can get a hydrometer that is labeled for gas both from the fermentation process, and from the reading of light and air. The rated gas for fuel is 44.99 °.', tags=[]), Reference(output='Specific gravity is the density of a liquid compared to water. [substeps] The bulbous end is placed into the liquid in question while the narrow stem will stick out of the liquid.', tags=['correct']), Reference(output='[substeps] Hydrometers are often made of silver or stainless steel. When finished, the metal is much more solid, and will generally be made of real metal.', tags=[]), Reference(output='There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', tags=[])], split='valid', sub_split=None, id='id39905', perturbation=None, contrast_inputs=None, contrast_references=None), reference_index=3, request_mode='calibration', train_trial_index=0, output_mapping=None, request=Request(model='together/gpt-j-6b', prompt='Answer: There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', temperature=0, num_completions=1, top_k_per_token=1, max_tokens=0, stop_sequences=[], echo_prompt=True, top_p=1, presence_penalty=0, frequency_penalty=0, random=None), result=RequestResult(success=True, completions=[Sequence(text=' are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', logprob=-259.1124801635742, tokens=[Token(text='Ġare', logprob=0, top_logprobs={}), Token(text='Ġtwo', logprob=-5.34375, top_logprobs={'Ġthe': -3.0546875}), Token(text='Ġbrands', logprob=-8.9453125, top_logprobs={'Ġways': -3.068359375}), Token(text='Ġof', logprob=-0.88818359375, top_logprobs={'Ġof': -0.88818359375}), Token(text='Ġhyd', logprob=-8.3515625, top_logprobs={'Ġthe': -3.28125}), Token(text='rom', logprob=-2.279296875, top_logprobs={'roc': -1.3095703125}), Token(text='eters', logprob=-3.111328125, top_logprobs={'or': -0.2205810546875}), Token(text='Ġavailable', logprob=-3.212890625, top_logprobs={',': -1.759765625}), Token(text='Ġon', logprob=-2.00390625, top_logprobs={'.': -1.9560546875}), Token(text='Ġthe', logprob=-0.128173828125, top_logprobs={'Ġthe': -0.128173828125}), Token(text='Ġmarket', logprob=-0.0716552734375, top_logprobs={'Ġmarket': -0.0716552734375}), Token(text=',', logprob=-1.689453125, top_logprobs={'.': -1.158203125}), Token(text='Ġand', logprob=-2.583984375, top_logprobs={'Ġthe': -1.5615234375}), Token(text='Ġthe', logprob=-2.484375, top_logprobs={'Ġthey': -1.6640625}), Token(text='Ġones', logprob=-5.0078125, top_logprobs={'Ġone': -2.638671875}), Token(text='Ġapproved', logprob=-8.640625, top_logprobs={'ĠI': -1.529296875}), Token(text='Ġby', logprob=-0.2919921875, top_logprobs={'Ġby': -0.2919921875}), Token(text='Ġinternational', logprob=-6.9140625, top_logprobs={'Ġthe': -0.467529296875}), Token(text='Ġstandards', logprob=-1.296875, top_logprobs={'Ġstandards': -1.296875}), Token(text='Ġare', logprob=-0.82080078125, top_logprobs={'Ġare': -0.82080078125}), Token(text='Ġ:', logprob=-7.609375, top_logprobs={'Ġthe': -2.056640625}), Token(text='Ġ[', logprob=-8.3984375, top_logprobs={'Ċ': -0.80908203125}), Token(text='sub', logprob=-9.125, top_logprobs={'Table': -2.12109375}), Token(text='steps', logprob=-10.2421875, top_logprobs={'scription': -1.279296875}), Token(text=']', logprob=-1.048828125, top_logprobs={']': -1.048828125}), Token(text='ĠDi', logprob=-9.375, top_logprobs={'Ċ': -1.5693359375}), Token(text='ich', logprob=-12.6953125, top_logprobs={'ast': -0.93896484375}), Token(text='yd', logprob=-6.3828125, top_logprobs={'rom': -0.5234375}), Token(text='rase', logprob=-11.578125, top_logprobs={'rom': -0.47119140625}), Token(text='Ġ(', logprob=-3.05859375, top_logprobs={',': -2.00390625}), Token(text='blue', logprob=-8.875, top_logprobs={'D': -2.4921875}), Token(text='Ġor', logprob=-4.63671875, top_logprobs={')': -1.1455078125}), Token(text='Ġgrey', logprob=-3.5703125, top_logprobs={'Ġgreen': -1.8046875}), Token(text='Ġwithout', logprob=-9.0546875, top_logprobs={')': -1.056640625}), Token(text='Ġthe', logprob=-2.48046875, top_logprobs={'Ġa': -2.08203125}), Token(text='Ġlabel', logprob=-4.19921875, top_logprobs={'Ġletter': -3.123046875}), Token(text='Ġ"', logprob=-3.943359375, top_logprobs={')': -1.021484375}), Token(text='Ġsodium', logprob=-12.09375, top_logprobs={'Di': -3.234375}), Token(text='Ġhyd', logprob=-2.13671875, top_logprobs={'Ġchloride': -1.9267578125}), Token(text='rom', logprob=-3.501953125, top_logprobs={'rox': -0.0810546875}), Token(text='eter', logprob=-0.2010498046875, top_logprobs={'eter': -0.2010498046875}), Token(text='Ġ"', logprob=-1.9638671875, top_logprobs={'"': -1.4326171875}), Token(text=').', logprob=-3.5078125, top_logprobs={')': -1.484375}), Token(text='ĠThe', logprob=-2.291015625, top_logprobs={'Ġ[': -2.158203125}), Token(text='Ġdi', logprob=-5.1015625, top_logprobs={'Ġother': -2.2890625}), Token(text='ich', logprob=-1.560546875, top_logprobs={'hyd': -0.654296875}), Token(text='yd', logprob=-0.00273895263671875, top_logprobs={'yd': -0.00273895263671875}), Token(text='rase', logprob=-0.1517333984375, top_logprobs={'rase': -0.1517333984375}), Token(text='Ġmeter', logprob=-7.93359375, top_logprobs={'Ġis': -1.326171875}), Token(text='Ġmeasures', logprob=-4.2578125, top_logprobs={'Ġis': -0.80615234375}), Token(text='Ġthe', logprob=-0.48291015625, top_logprobs={'Ġthe': -0.48291015625}), Token(text='Ġne', logprob=-14.765625, top_logprobs={'Ġdensity': -1.13671875}), Token(text='pt', logprob=-2.80859375, top_logprobs={'ph': -1.26953125}), Token(text='une', logprob=-4.0703125, top_logprobs={'un': -0.236328125}), Token(text="'s", logprob=-2.791015625, top_logprobs={'Ġof': -2.775390625}), Token(text='Ġmass', logprob=-6.21875, top_logprobs={'Ġspecific': -0.98486328125}), Token(text='.', logprob=-2.931640625, top_logprobs={'Ġin': -1.369140625})], finish_reason={'reason': 'length'})], cached=True, request_time=0, request_datetime=None, error=None, batch_size=245, batch_request_time=4.580406188964844), num_in_context_examples=0, input_truncated=False, num_conditioning_tokens=0)
Note the part highlighted with
*
:RequestState(..., request=Request( model='together/gpt-j-6b', prompt=*'Answer: There are two brands* of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', temperature=0, num_completions=1, top_k_per_token=1, max_tokens=0, stop_sequences=[], echo_prompt=True, top_p=1, presence_penalty=0, frequency_penalty=0, random=None ), result=RequestResult( success=True, completions=[Sequence(text=*' are two brands* of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', logprob=-259.1124801635742, tokens=[...]] ... )
I'm not sure why
result.completions
from GPT-J have several missing tokens:Answer: There
, while these tokens clearly exist in therequest.prompt
.@LorrinWWW Do you have any insight to this? This is when
echo=True
andmax_tokens=0
fortogether/gpt-j-6b
.
I looked into for Bloom and OPT. It could be a truncation/padding issue in an earlier version (from our side) and the current version does not have this issue. (I guess it should be the same case for GPT-J). @teetone Could you send me all requests where the result.completions
does not match request.prompt
? I will rerun and check them before sending to you. Thanks!
together/gpt-j-6b
case seems much more complex, and seems the bug is from other parts. Error:AssertionError: Expected: There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune's mass. ----> 245 chars Actual: ĠareĠtwoĠbrandsĠofĠhydrometersĠavailableĠonĠtheĠmarket,ĠandĠtheĠonesĠapprovedĠbyĠinternationalĠstandardsĠareĠ:Ġ[substeps]ĠDiichydraseĠ(blueĠorĠgreyĠwithoutĠtheĠlabelĠ"ĠsodiumĠhydrometerĠ").ĠTheĠdiichydraseĠmeterĠmeasuresĠtheĠneptune'sĠmass. ----> 240 chars
Why the actual length < expected length? The loop will only exit if actual length >= expected length.. It seems the request is wrong. Here is the raw output from
print(request_state)
:RequestState(instance=Instance(input='Education and Communications: [header] How to calibrate a hydrometer [title] Identify the parts of the hydrometer. [step] A hydrometer is a glass device that has a bulbous, weighted end designed to float in a liquid and a narrow, long stem with a graduated scale on the other end. It is used to measure the specific gravity of a liquid. ', references=[Reference(output='[substeps] You can get a hydrometer that is labeled for gas both from the fermentation process, and from the reading of light and air. The rated gas for fuel is 44.99 °.', tags=[]), Reference(output='Specific gravity is the density of a liquid compared to water. [substeps] The bulbous end is placed into the liquid in question while the narrow stem will stick out of the liquid.', tags=['correct']), Reference(output='[substeps] Hydrometers are often made of silver or stainless steel. When finished, the metal is much more solid, and will generally be made of real metal.', tags=[]), Reference(output='There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', tags=[])], split='valid', sub_split=None, id='id39905', perturbation=None, contrast_inputs=None, contrast_references=None), reference_index=3, request_mode='calibration', train_trial_index=0, output_mapping=None, request=Request(model='together/gpt-j-6b', prompt='Answer: There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', temperature=0, num_completions=1, top_k_per_token=1, max_tokens=0, stop_sequences=[], echo_prompt=True, top_p=1, presence_penalty=0, frequency_penalty=0, random=None), result=RequestResult(success=True, completions=[Sequence(text=' are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', logprob=-259.1124801635742, tokens=[Token(text='Ġare', logprob=0, top_logprobs={}), Token(text='Ġtwo', logprob=-5.34375, top_logprobs={'Ġthe': -3.0546875}), Token(text='Ġbrands', logprob=-8.9453125, top_logprobs={'Ġways': -3.068359375}), Token(text='Ġof', logprob=-0.88818359375, top_logprobs={'Ġof': -0.88818359375}), Token(text='Ġhyd', logprob=-8.3515625, top_logprobs={'Ġthe': -3.28125}), Token(text='rom', logprob=-2.279296875, top_logprobs={'roc': -1.3095703125}), Token(text='eters', logprob=-3.111328125, top_logprobs={'or': -0.2205810546875}), Token(text='Ġavailable', logprob=-3.212890625, top_logprobs={',': -1.759765625}), Token(text='Ġon', logprob=-2.00390625, top_logprobs={'.': -1.9560546875}), Token(text='Ġthe', logprob=-0.128173828125, top_logprobs={'Ġthe': -0.128173828125}), Token(text='Ġmarket', logprob=-0.0716552734375, top_logprobs={'Ġmarket': -0.0716552734375}), Token(text=',', logprob=-1.689453125, top_logprobs={'.': -1.158203125}), Token(text='Ġand', logprob=-2.583984375, top_logprobs={'Ġthe': -1.5615234375}), Token(text='Ġthe', logprob=-2.484375, top_logprobs={'Ġthey': -1.6640625}), Token(text='Ġones', logprob=-5.0078125, top_logprobs={'Ġone': -2.638671875}), Token(text='Ġapproved', logprob=-8.640625, top_logprobs={'ĠI': -1.529296875}), Token(text='Ġby', logprob=-0.2919921875, top_logprobs={'Ġby': -0.2919921875}), Token(text='Ġinternational', logprob=-6.9140625, top_logprobs={'Ġthe': -0.467529296875}), Token(text='Ġstandards', logprob=-1.296875, top_logprobs={'Ġstandards': -1.296875}), Token(text='Ġare', logprob=-0.82080078125, top_logprobs={'Ġare': -0.82080078125}), Token(text='Ġ:', logprob=-7.609375, top_logprobs={'Ġthe': -2.056640625}), Token(text='Ġ[', logprob=-8.3984375, top_logprobs={'Ċ': -0.80908203125}), Token(text='sub', logprob=-9.125, top_logprobs={'Table': -2.12109375}), Token(text='steps', logprob=-10.2421875, top_logprobs={'scription': -1.279296875}), Token(text=']', logprob=-1.048828125, top_logprobs={']': -1.048828125}), Token(text='ĠDi', logprob=-9.375, top_logprobs={'Ċ': -1.5693359375}), Token(text='ich', logprob=-12.6953125, top_logprobs={'ast': -0.93896484375}), Token(text='yd', logprob=-6.3828125, top_logprobs={'rom': -0.5234375}), Token(text='rase', logprob=-11.578125, top_logprobs={'rom': -0.47119140625}), Token(text='Ġ(', logprob=-3.05859375, top_logprobs={',': -2.00390625}), Token(text='blue', logprob=-8.875, top_logprobs={'D': -2.4921875}), Token(text='Ġor', logprob=-4.63671875, top_logprobs={')': -1.1455078125}), Token(text='Ġgrey', logprob=-3.5703125, top_logprobs={'Ġgreen': -1.8046875}), Token(text='Ġwithout', logprob=-9.0546875, top_logprobs={')': -1.056640625}), Token(text='Ġthe', logprob=-2.48046875, top_logprobs={'Ġa': -2.08203125}), Token(text='Ġlabel', logprob=-4.19921875, top_logprobs={'Ġletter': -3.123046875}), Token(text='Ġ"', logprob=-3.943359375, top_logprobs={')': -1.021484375}), Token(text='Ġsodium', logprob=-12.09375, top_logprobs={'Di': -3.234375}), Token(text='Ġhyd', logprob=-2.13671875, top_logprobs={'Ġchloride': -1.9267578125}), Token(text='rom', logprob=-3.501953125, top_logprobs={'rox': -0.0810546875}), Token(text='eter', logprob=-0.2010498046875, top_logprobs={'eter': -0.2010498046875}), Token(text='Ġ"', logprob=-1.9638671875, top_logprobs={'"': -1.4326171875}), Token(text=').', logprob=-3.5078125, top_logprobs={')': -1.484375}), Token(text='ĠThe', logprob=-2.291015625, top_logprobs={'Ġ[': -2.158203125}), Token(text='Ġdi', logprob=-5.1015625, top_logprobs={'Ġother': -2.2890625}), Token(text='ich', logprob=-1.560546875, top_logprobs={'hyd': -0.654296875}), Token(text='yd', logprob=-0.00273895263671875, top_logprobs={'yd': -0.00273895263671875}), Token(text='rase', logprob=-0.1517333984375, top_logprobs={'rase': -0.1517333984375}), Token(text='Ġmeter', logprob=-7.93359375, top_logprobs={'Ġis': -1.326171875}), Token(text='Ġmeasures', logprob=-4.2578125, top_logprobs={'Ġis': -0.80615234375}), Token(text='Ġthe', logprob=-0.48291015625, top_logprobs={'Ġthe': -0.48291015625}), Token(text='Ġne', logprob=-14.765625, top_logprobs={'Ġdensity': -1.13671875}), Token(text='pt', logprob=-2.80859375, top_logprobs={'ph': -1.26953125}), Token(text='une', logprob=-4.0703125, top_logprobs={'un': -0.236328125}), Token(text="'s", logprob=-2.791015625, top_logprobs={'Ġof': -2.775390625}), Token(text='Ġmass', logprob=-6.21875, top_logprobs={'Ġspecific': -0.98486328125}), Token(text='.', logprob=-2.931640625, top_logprobs={'Ġin': -1.369140625})], finish_reason={'reason': 'length'})], cached=True, request_time=0, request_datetime=None, error=None, batch_size=245, batch_request_time=4.580406188964844), num_in_context_examples=0, input_truncated=False, num_conditioning_tokens=0)
Note the part highlighted with
*
:RequestState(..., request=Request( model='together/gpt-j-6b', prompt=*'Answer: There are two brands* of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', temperature=0, num_completions=1, top_k_per_token=1, max_tokens=0, stop_sequences=[], echo_prompt=True, top_p=1, presence_penalty=0, frequency_penalty=0, random=None ), result=RequestResult( success=True, completions=[Sequence(text=*' are two brands* of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', logprob=-259.1124801635742, tokens=[...]] ... )
I'm not sure why
result.completions
from GPT-J have several missing tokens:Answer: There
, while these tokens clearly exist in therequest.prompt
.@LorrinWWW Do you have any insight to this? This is when
echo=True
andmax_tokens=0
fortogether/gpt-j-6b
.I looked into for Bloom and OPT. It could be a truncation/padding issue in an earlier version (from our side) and the current version does not have this issue. (I guess it should be the same case for GPT-J). @teetone Could you send me all requests where the
result.completions
does not matchrequest.prompt
? I will rerun and check them before sending to you. Thanks!
To be safe, could we regenerate results for queries where echo=prompt
for those models? It shouldn't be too many.
together/gpt-j-6b
case seems much more complex, and seems the bug is from other parts.
Error:AssertionError: Expected: There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune's mass. ----> 245 chars Actual: ĠareĠtwoĠbrandsĠofĠhydrometersĠavailableĠonĠtheĠmarket,ĠandĠtheĠonesĠapprovedĠbyĠinternationalĠstandardsĠareĠ:Ġ[substeps]ĠDiichydraseĠ(blueĠorĠgreyĠwithoutĠtheĠlabelĠ"ĠsodiumĠhydrometerĠ").ĠTheĠdiichydraseĠmeterĠmeasuresĠtheĠneptune'sĠmass. ----> 240 chars
Why the actual length \< expected length? The loop will only exit if actual length >= expected length..
It seems the request is wrong. Here is the raw output fromprint(request_state)
:RequestState(instance=Instance(input='Education and Communications: [header] How to calibrate a hydrometer [title] Identify the parts of the hydrometer. [step] A hydrometer is a glass device that has a bulbous, weighted end designed to float in a liquid and a narrow, long stem with a graduated scale on the other end. It is used to measure the specific gravity of a liquid. ', references=[Reference(output='[substeps] You can get a hydrometer that is labeled for gas both from the fermentation process, and from the reading of light and air. The rated gas for fuel is 44.99 °.', tags=[]), Reference(output='Specific gravity is the density of a liquid compared to water. [substeps] The bulbous end is placed into the liquid in question while the narrow stem will stick out of the liquid.', tags=['correct']), Reference(output='[substeps] Hydrometers are often made of silver or stainless steel. When finished, the metal is much more solid, and will generally be made of real metal.', tags=[]), Reference(output='There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', tags=[])], split='valid', sub_split=None, id='id39905', perturbation=None, contrast_inputs=None, contrast_references=None), reference_index=3, request_mode='calibration', train_trial_index=0, output_mapping=None, request=Request(model='together/gpt-j-6b', prompt='Answer: There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', temperature=0, num_completions=1, top_k_per_token=1, max_tokens=0, stop_sequences=[], echo_prompt=True, top_p=1, presence_penalty=0, frequency_penalty=0, random=None), result=RequestResult(success=True, completions=[Sequence(text=' are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', logprob=-259.1124801635742, tokens=[Token(text='Ġare', logprob=0, top_logprobs={}), Token(text='Ġtwo', logprob=-5.34375, top_logprobs={'Ġthe': -3.0546875}), Token(text='Ġbrands', logprob=-8.9453125, top_logprobs={'Ġways': -3.068359375}), Token(text='Ġof', logprob=-0.88818359375, top_logprobs={'Ġof': -0.88818359375}), Token(text='Ġhyd', logprob=-8.3515625, top_logprobs={'Ġthe': -3.28125}), Token(text='rom', logprob=-2.279296875, top_logprobs={'roc': -1.3095703125}), Token(text='eters', logprob=-3.111328125, top_logprobs={'or': -0.2205810546875}), Token(text='Ġavailable', logprob=-3.212890625, top_logprobs={',': -1.759765625}), Token(text='Ġon', logprob=-2.00390625, top_logprobs={'.': -1.9560546875}), Token(text='Ġthe', logprob=-0.128173828125, top_logprobs={'Ġthe': -0.128173828125}), Token(text='Ġmarket', logprob=-0.0716552734375, top_logprobs={'Ġmarket': -0.0716552734375}), Token(text=',', logprob=-1.689453125, top_logprobs={'.': -1.158203125}), Token(text='Ġand', logprob=-2.583984375, top_logprobs={'Ġthe': -1.5615234375}), Token(text='Ġthe', logprob=-2.484375, top_logprobs={'Ġthey': -1.6640625}), Token(text='Ġones', logprob=-5.0078125, top_logprobs={'Ġone': -2.638671875}), Token(text='Ġapproved', logprob=-8.640625, top_logprobs={'ĠI': -1.529296875}), Token(text='Ġby', logprob=-0.2919921875, top_logprobs={'Ġby': -0.2919921875}), Token(text='Ġinternational', logprob=-6.9140625, top_logprobs={'Ġthe': -0.467529296875}), Token(text='Ġstandards', logprob=-1.296875, top_logprobs={'Ġstandards': -1.296875}), Token(text='Ġare', logprob=-0.82080078125, top_logprobs={'Ġare': -0.82080078125}), Token(text='Ġ:', logprob=-7.609375, top_logprobs={'Ġthe': -2.056640625}), Token(text='Ġ[', logprob=-8.3984375, top_logprobs={'Ċ': -0.80908203125}), Token(text='sub', logprob=-9.125, top_logprobs={'Table': -2.12109375}), Token(text='steps', logprob=-10.2421875, top_logprobs={'scription': -1.279296875}), Token(text=']', logprob=-1.048828125, top_logprobs={']': -1.048828125}), Token(text='ĠDi', logprob=-9.375, top_logprobs={'Ċ': -1.5693359375}), Token(text='ich', logprob=-12.6953125, top_logprobs={'ast': -0.93896484375}), Token(text='yd', logprob=-6.3828125, top_logprobs={'rom': -0.5234375}), Token(text='rase', logprob=-11.578125, top_logprobs={'rom': -0.47119140625}), Token(text='Ġ(', logprob=-3.05859375, top_logprobs={',': -2.00390625}), Token(text='blue', logprob=-8.875, top_logprobs={'D': -2.4921875}), Token(text='Ġor', logprob=-4.63671875, top_logprobs={')': -1.1455078125}), Token(text='Ġgrey', logprob=-3.5703125, top_logprobs={'Ġgreen': -1.8046875}), Token(text='Ġwithout', logprob=-9.0546875, top_logprobs={')': -1.056640625}), Token(text='Ġthe', logprob=-2.48046875, top_logprobs={'Ġa': -2.08203125}), Token(text='Ġlabel', logprob=-4.19921875, top_logprobs={'Ġletter': -3.123046875}), Token(text='Ġ"', logprob=-3.943359375, top_logprobs={')': -1.021484375}), Token(text='Ġsodium', logprob=-12.09375, top_logprobs={'Di': -3.234375}), Token(text='Ġhyd', logprob=-2.13671875, top_logprobs={'Ġchloride': -1.9267578125}), Token(text='rom', logprob=-3.501953125, top_logprobs={'rox': -0.0810546875}), Token(text='eter', logprob=-0.2010498046875, top_logprobs={'eter': -0.2010498046875}), Token(text='Ġ"', logprob=-1.9638671875, top_logprobs={'"': -1.4326171875}), Token(text=').', logprob=-3.5078125, top_logprobs={')': -1.484375}), Token(text='ĠThe', logprob=-2.291015625, top_logprobs={'Ġ[': -2.158203125}), Token(text='Ġdi', logprob=-5.1015625, top_logprobs={'Ġother': -2.2890625}), Token(text='ich', logprob=-1.560546875, top_logprobs={'hyd': -0.654296875}), Token(text='yd', logprob=-0.00273895263671875, top_logprobs={'yd': -0.00273895263671875}), Token(text='rase', logprob=-0.1517333984375, top_logprobs={'rase': -0.1517333984375}), Token(text='Ġmeter', logprob=-7.93359375, top_logprobs={'Ġis': -1.326171875}), Token(text='Ġmeasures', logprob=-4.2578125, top_logprobs={'Ġis': -0.80615234375}), Token(text='Ġthe', logprob=-0.48291015625, top_logprobs={'Ġthe': -0.48291015625}), Token(text='Ġne', logprob=-14.765625, top_logprobs={'Ġdensity': -1.13671875}), Token(text='pt', logprob=-2.80859375, top_logprobs={'ph': -1.26953125}), Token(text='une', logprob=-4.0703125, top_logprobs={'un': -0.236328125}), Token(text="'s", logprob=-2.791015625, top_logprobs={'Ġof': -2.775390625}), Token(text='Ġmass', logprob=-6.21875, top_logprobs={'Ġspecific': -0.98486328125}), Token(text='.', logprob=-2.931640625, top_logprobs={'Ġin': -1.369140625})], finish_reason={'reason': 'length'})], cached=True, request_time=0, request_datetime=None, error=None, batch_size=245, batch_request_time=4.580406188964844), num_in_context_examples=0, input_truncated=False, num_conditioning_tokens=0)
Note the part highlighted with
*
:RequestState(..., request=Request( model='together/gpt-j-6b', prompt=*'Answer: There are two brands* of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', temperature=0, num_completions=1, top_k_per_token=1, max_tokens=0, stop_sequences=[], echo_prompt=True, top_p=1, presence_penalty=0, frequency_penalty=0, random=None ), result=RequestResult( success=True, completions=[Sequence(text=*' are two brands* of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', logprob=-259.1124801635742, tokens=[...]] ... )
I'm not sure why
result.completions
from GPT-J have several missing tokens:Answer: There
, while these tokens clearly exist in therequest.prompt
.@LorrinWWW Do you have any insight to this? This is when
echo=True
andmax_tokens=0
fortogether/gpt-j-6b
.I looked into for Bloom and OPT. It could be a truncation/padding issue in an earlier version (from our side) and the current version does not have this issue. (I guess it should be the same case for GPT-J). @teetone Could you send me all requests where the
result.completions
does not matchrequest.prompt
? I will rerun and check them before sending to you. Thanks!To be safe, could we regenerate results for queries where
echo=prompt
for those models? It shouldn't be too many.
Sure
Okay, now I can reproduce this bug by the following command and input files.
venv/bin/benchmark-run -r commonsense:model=openai/text-babbage-001,dataset=hellaswag,method=multiple_choice_separate_calibrated --suite 0903
hellaswag_val.jsonl {"ind": 11289, "ctx": "[header] How to steam broccoli without a steamer [title] Wash your broccoli. [step] Check for insects. [title] Cut the broccoli as desired.", "activity_label": "Food and Entertaining", "ctx_a": "[header] How to steam broccoli without a steamer [title] Wash your broccoli. [step] Check for insects. [title] Cut the broccoli as desired.", "ctx_b": "", "split": "val", "split_type": "indomain", "label": 2, "endings": ["[step] Using a vendor-specific bean peeler, cut the broccoli into \u00bc inch-wide strips. This works best when you use cucumbers.", "[title] Wash each piece inside out then place them in a steamer bag. [step] Save the white part if you plan to steam your broccoli afterward.", "[step] Cutting the broccoli into small pieces will help it to cook faster. [substeps] If you want to eat the stalks, they should be cut into pieces that are slightly smaller than the florets.", "[step] Rinse, drain the water, and cut once done. [title] Place your cooked broccoli into a steamer basket/pot."], "source_id": "wikihow~18617"}
And I now understand what happened by printing the original input:
AssertionError: Expected: [step] Using a vendor-specific bean peeler, cut the broccoli into ¼ inch-wide strips. This works best when you use cucumbers. Actual: vendor-specific bean peeler, cut the broccoli intobytes: \xc2bytes:\xbc inch-wide strips. This works best when you use cucumbers.
So the ¼ (1 char) is extended to multiple characters by the tokenizer. Therefore, counting span length will lead to the error. The pull request #820 cannot fix this problem as well, because it will never reach the loop break condition. Do you have any suggestions for this? @percyliang @teetone It seems the only perfect solution is to directly get the tokenized choices and compute
len(tokenized_choice)
? Or another possible solution is to filter out all the non-ascii input chars?..It's also missing
Using a
in the beginning, right? Do you know why that is? I think the main problem is the missing couple of tokens in the beginning.
Yes, this is because we select the span by counting the length of last tokens.
For example, if len(reference) = 125, we will keep appending tokens from completions until len(span) >= 125.
# Pseudo-code for span selection
span_tokens = []
for token in in completions[::-1]:
if len(''.join(span_tokens)) >= len(reference):
break
span_tokens.prepend(token)
Because the tokenizer now extends ¼
(1 char) to bytes:\xbc bytes: \xc2
(21 chars). The loop will end earlier and thus first several tokens are missing.
Here is the output from print(token, len(token))
during the for token in completions[::-1]
:
Completions:
. 1
umbers 6
cuc 4
use 4
you 4
when 5
best 5
works 6
This 5
. 1
strips 7
wide 4
- 1
inch 5
bytes:\xbc 10
bytes: \xc2 11
into 5
broccoli 9
the 4
cut 4
, 1
er 2
peel 5
bean 5
specific 8
- 1
vendor 7
---> length = 130
Reference:
Food and Entertaining: [header] How to steam broccoli without a steamer [title] Wash your broccoli. [step] Check for insects. [title] Cut the broccoli as desired. [step] Using a vendor-specific bean peeler, cut the broccoli intobytes: \xc2bytes:\xbc inch-wide strips. This works best when you use cucumbers.
---> length = 125
Therefore, this is a bug induced by the tokenizer: the tokenizer should not change the length of inputs. This can be fixed with a detokenizer, but we don't have this component in the current code base.
Okay, now I can reproduce this bug by the following command and input files.
venv/bin/benchmark-run -r commonsense:model=openai/text-babbage-001,dataset=hellaswag,method=multiple_choice_separate_calibrated --suite 0903
hellaswag_val.jsonl {"ind": 11289, "ctx": "[header] How to steam broccoli without a steamer [title] Wash your broccoli. [step] Check for insects. [title] Cut the broccoli as desired.", "activity_label": "Food and Entertaining", "ctx_a": "[header] How to steam broccoli without a steamer [title] Wash your broccoli. [step] Check for insects. [title] Cut the broccoli as desired.", "ctx_b": "", "split": "val", "split_type": "indomain", "label": 2, "endings": ["[step] Using a vendor-specific bean peeler, cut the broccoli into \u00bc inch-wide strips. This works best when you use cucumbers.", "[title] Wash each piece inside out then place them in a steamer bag. [step] Save the white part if you plan to steam your broccoli afterward.", "[step] Cutting the broccoli into small pieces will help it to cook faster. [substeps] If you want to eat the stalks, they should be cut into pieces that are slightly smaller than the florets.", "[step] Rinse, drain the water, and cut once done. [title] Place your cooked broccoli into a steamer basket/pot."], "source_id": "wikihow~18617"}
And I now understand what happened by printing the original input:
AssertionError: Expected: [step] Using a vendor-specific bean peeler, cut the broccoli into ¼ inch-wide strips. This works best when you use cucumbers. Actual: vendor-specific bean peeler, cut the broccoli intobytes: \xc2bytes:\xbc inch-wide strips. This works best when you use cucumbers.
So the ¼ (1 char) is extended to multiple characters by the tokenizer. Therefore, counting span length will lead to the error. The pull request #820 cannot fix this problem as well, because it will never reach the loop break condition. Do you have any suggestions for this? @percyliang @teetone It seems the only perfect solution is to directly get the tokenized choices and compute
len(tokenized_choice)
? Or another possible solution is to filter out all the non-ascii input chars?..It's also missing
Using a
in the beginning, right? Do you know why that is? I think the main problem is the missing couple of tokens in the beginning.Yes, this is because we select the span by counting the length of last tokens.
For example, if len(reference) = 125, we will keep appending tokens from completions until len(span) >= 125.
# Pseudo-code for span selection span_tokens = [] for token in in completions[::-1]: if len(''.join(span_tokens)) >= len(reference): break span_tokens.prepend(token)
Because the tokenizer now extends
¼
(1 char) tobytes:\xbc bytes: \xc2
(21 chars). The loop will end earlier and thus first several tokens are missing.Here is the output from
print(token, len(token))
during thefor token in completions[::-1]
:Completions: . 1 umbers 6 cuc 4 use 4 you 4 when 5 best 5 works 6 This 5 . 1 strips 7 wide 4 - 1 inch 5 bytes:\xbc 10 bytes: \xc2 11 into 5 broccoli 9 the 4 cut 4 , 1 er 2 peel 5 bean 5 specific 8 - 1 vendor 7 ---> length = 130 Reference: Food and Entertaining: [header] How to steam broccoli without a steamer [title] Wash your broccoli. [step] Check for insects. [title] Cut the broccoli as desired. [step] Using a vendor-specific bean peeler, cut the broccoli intobytes: \xc2bytes:\xbc inch-wide strips. This works best when you use cucumbers. ---> length = 125
Therefore, this is a bug induced by the tokenizer: the tokenizer should not change the length of inputs. This can be fixed with a detokenizer, but we don't have this component in the current code base.
We have decode
of WindowService
and WindowServiceFactory
. There are examples in the code of how they are used.
Thanks for pointing this out! #826 should be a perfect fix.
model=together_gpt-j-6b
Error when running commonsense:model=full_functionality_text,dataset=hellaswag,method=multiple_choice_separate_calibrated: Traceback (most recent call last): File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/presentation/present.py", line 98, in run new_run_specs = run_benchmarking( File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/run.py", line 60, in run_benchmarking runner.run_all() File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/runner.py", line 90, in run_all self.run_one(run_spec) File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/runner.py", line 129, in run_one metric_result: MetricResult = metric.evaluate( File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/metrics/metric.py", line 150, in evaluate results: List[List[Stat]] = parallel_map( File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/common/general.py", line 183, in parallel_map results: List = list(tqdm(executor.map(process, items), total=len(items))) File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/lib/python3.8/site-packages/tqdm/std.py", line 1195, in iter for obj in iterable: File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator yield fs.pop().result() File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/lib/python3.8/concurrent/futures/_base.py", line 437, in result return self.__get_result() File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result raise self._exception File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/lib/python3.8/concurrent/futures/thread.py", line 57, in run result = self.fn(*self.args, **self.kwargs) File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/metrics/metric.py", line 88, in process self.metric.evaluate_references( File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/metrics/basic_metrics.py", line 693, in evaluate_references reference_stats[reference_key] = compute_logprob_and_length(request_state) File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/metrics/basic_metrics.py", line 673, in compute_logprob_and_length assert filtered_span == filtered_reference, f"Expected: {filtered_reference}, Actual: {filtered_span}" AssertionError: Expected: TherearetwobrandsofhydrometersavailableonthemarketandtheonesapprovedbyinternationalstandardsaresubstepsDiichydraseblueorgreywithoutthelabelsodiumhydrometerThediichydrasemetermeasurestheneptunesmass, Actual: aretwobrandsofhydrometersavailableonthemarketandtheonesapprovedbyinternationalstandardsaresubstepsDiichydraseblueorgreywithoutthelabelsodiumhydrometerThediichydrasemetermeasurestheneptunesmass