stanford-crfm / helm

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models (https://arxiv.org/abs/2211.09110). This framework is also used to evaluate text-to-image models in HEIM (https://arxiv.org/abs/2311.04287) and vision-language models in VHELM (https://arxiv.org/abs/2410.07112).
https://crfm.stanford.edu/helm
Apache License 2.0
1.95k stars 252 forks source link

Commonsense failing even with weakened check #816

Closed teetone closed 2 years ago

teetone commented 2 years ago

model=together_gpt-j-6b

Error when running commonsense:model=full_functionality_text,dataset=hellaswag,method=multiple_choice_separate_calibrated: Traceback (most recent call last): File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/presentation/present.py", line 98, in run new_run_specs = run_benchmarking( File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/run.py", line 60, in run_benchmarking runner.run_all() File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/runner.py", line 90, in run_all self.run_one(run_spec) File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/runner.py", line 129, in run_one metric_result: MetricResult = metric.evaluate( File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/metrics/metric.py", line 150, in evaluate results: List[List[Stat]] = parallel_map( File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/common/general.py", line 183, in parallel_map results: List = list(tqdm(executor.map(process, items), total=len(items))) File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/lib/python3.8/site-packages/tqdm/std.py", line 1195, in iter for obj in iterable: File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator yield fs.pop().result() File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/lib/python3.8/concurrent/futures/_base.py", line 437, in result return self.__get_result() File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result raise self._exception File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/lib/python3.8/concurrent/futures/thread.py", line 57, in run result = self.fn(*self.args, **self.kwargs) File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/metrics/metric.py", line 88, in process self.metric.evaluate_references( File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/metrics/basic_metrics.py", line 693, in evaluate_references reference_stats[reference_key] = compute_logprob_and_length(request_state) File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/metrics/basic_metrics.py", line 673, in compute_logprob_and_length assert filtered_span == filtered_reference, f"Expected: {filtered_reference}, Actual: {filtered_span}" AssertionError: Expected: TherearetwobrandsofhydrometersavailableonthemarketandtheonesapprovedbyinternationalstandardsaresubstepsDiichydraseblueorgreywithoutthelabelsodiumhydrometerThediichydrasemetermeasurestheneptunesmass, Actual: aretwobrandsofhydrometersavailableonthemarketandtheonesapprovedbyinternationalstandardsaresubstepsDiichydraseblueorgreywithoutthelabelsodiumhydrometerThediichydrasemetermeasurestheneptunesmass

yuhui-zh15 commented 2 years ago

Can you print the original reference and original span? I think that would be very helpful to understand the bug.

teetone commented 2 years ago

Can you print the original reference and original span? I think that would be very helpful to understand the bug.

Sorry, how would that help? It looks like the only difference is the "There" in the beginning.

teetone commented 2 years ago

@yuhui-zh15 It's also reproducible with openai/text-babbage-001:

  Error when running commonsense:model=full_functionality_text,dataset=hellaswag,method=multiple_choice_separate_calibrated:
Traceback (most recent call last):
  File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/presentation/present.py", line 98, in run
    new_run_specs = run_benchmarking(
  File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/run.py", line 60, in run_benchmarking
    runner.run_all()
  File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/runner.py", line 90, in run_all
    self.run_one(run_spec)
  File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/runner.py", line 129, in run_one
    metric_result: MetricResult = metric.evaluate(
  File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/metrics/metric.py", line 150, in evaluate
    results: List[List[Stat]] = parallel_map(
  File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/common/general.py", line 183, in parallel_map
    results: List = list(tqdm(executor.map(process, items), total=len(items)))
  File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/lib/python3.8/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/lib/python3.8/concurrent/futures/_base.py", line 619, in result_iterator
    yield fs.pop().result()
  File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/u/nlp/anaconda/main/anaconda3/envs/crfm_benchmarking/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/metrics/metric.py", line 88, in process
    self.metric.evaluate_references(
  File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/metrics/basic_metrics.py", line 693, in evaluate_references
    reference_stats[reference_key] = compute_logprob_and_length(request_state)
  File "/juice/scr/nlp/crfm/benchmarking/benchmarking/src/benchmark/metrics/basic_metrics.py", line 673, in compute_logprob_and_length
    assert filtered_span == filtered_reference, f"Expected: {filtered_reference}, Actual: {filtered_span}"
AssertionError: Expected: stepUsingavendorspecificbeanpeelercutthebroccoliintoinchwidestripsThisworksbestwhenyouusecucumbers, Actual: vendorspecificbeanpeelercutthebroccoliintobytesxc2bytesxbcinchwidestripsThisworksbestwhenyouusecucumbers
yuhui-zh15 commented 2 years ago

Hi, I don’t think we can understand the bug without printing original input. Perhaps this model uses weird tokenizer, so the token length != real reference length.

teetone commented 2 years ago

Hi, I don’t think we can understand the bug without printing original input. Perhaps this model uses weird tokenizer, so the token length != real reference length.

I was able to reproduce with openai/text-babbage-001, which uses the GPT-2 tokenizer. Could you try running with openai/text-babbage-001?

teetone commented 2 years ago

Can you print the original reference and original span? I think that would be very helpful to understand the bug.

Sorry, how would that help? It looks like the only difference is the "There" in the beginning.

Also, doesn't look like a weird tokenization error to me. The only difference is the word "There".

teetone commented 2 years ago

Hi, I don’t think we can understand the bug without printing original input. Perhaps this model uses weird tokenizer, so the token length != real reference length.

I think you mentioned that the check was added for debugging purposes. What if we just remove the check? Would it still be correct?

teetone commented 2 years ago

I found another example with the AI21 models:

    assert filtered_span == filtered_reference, f"Expected: {filtered_reference}, Actual: {filtered_span}"
AssertionError: Expected: alsosharesinformationonhowthingschangedlaterwhenshewasfinallyallowedtoparticipatefreely, Actual: Shealsosharesinformationonhowthingschangedlaterwhenshewasfinallyallowedtoparticipatefreely

It seems to be always missing the first word or token.

yuhui-zh15 commented 2 years ago

I proposed a fix in #820 but haven't verified the correctness.

yuhui-zh15 commented 2 years ago

Okay, now I can reproduce this bug by the following command and input files.

venv/bin/benchmark-run -r commonsense:model=openai/text-babbage-001,dataset=hellaswag,method=multiple_choice_separate_calibrated --suite 0903
hellaswag_val.jsonl
{"ind": 11289, "ctx": "[header] How to steam broccoli without a steamer [title] Wash your broccoli. [step] Check for insects. [title] Cut the broccoli as desired.", "activity_label": "Food and Entertaining", "ctx_a": "[header] How to steam broccoli without a steamer [title] Wash your broccoli. [step] Check for insects. [title] Cut the broccoli as desired.", "ctx_b": "", "split": "val", "split_type": "indomain", "label": 2, "endings": ["[step] Using a vendor-specific bean peeler, cut the broccoli into \u00bc inch-wide strips. This works best when you use cucumbers.", "[title] Wash each piece inside out then place them in a steamer bag. [step] Save the white part if you plan to steam your broccoli afterward.", "[step] Cutting the broccoli into small pieces will help it to cook faster. [substeps] If you want to eat the stalks, they should be cut into pieces that are slightly smaller than the florets.", "[step] Rinse, drain the water, and cut once done. [title] Place your cooked broccoli into a steamer basket/pot."], "source_id": "wikihow~18617"}

And I now understand what happened by printing the original input:

AssertionError: 
Expected: [step] Using a vendor-specific bean peeler, cut the broccoli into ¼ inch-wide strips. This works best when you use cucumbers.
Actual: vendor-specific bean peeler, cut the broccoli intobytes: \xc2bytes:\xbc inch-wide strips. This works best when you use cucumbers.

So the ¼ (1 char) is extended to multiple characters by the tokenizer. Therefore, counting span length will lead to the error.

The pull request #820 cannot fix this problem as well, because it will never reach the loop break condition.

Do you have any suggestions for this? @percyliang @teetone

It seems the only perfect solution is to directly get the tokenized choices and compute len(tokenized_choice)? Or another possible solution is to filter out all the non-ascii input chars?..

yuhui-zh15 commented 2 years ago

together/gpt-j-6b case seems much more complex, and seems the bug is from other parts.

Error:

AssertionError: 
Expected: There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune's mass.  
----> 245 chars
Actual: ĠareĠtwoĠbrandsĠofĠhydrometersĠavailableĠonĠtheĠmarket,ĠandĠtheĠonesĠapprovedĠbyĠinternationalĠstandardsĠareĠ:Ġ[substeps]ĠDiichydraseĠ(blueĠorĠgreyĠwithoutĠtheĠlabelĠ"ĠsodiumĠhydrometerĠ").ĠTheĠdiichydraseĠmeterĠmeasuresĠtheĠneptune'sĠmass.
----> 240 chars

Why the actual length < expected length? The loop will only exit if actual length >= expected length..

It seems the request is wrong. Here is the raw output from print(request_state):

RequestState(instance=Instance(input='Education and Communications: [header] How to calibrate a hydrometer [title] Identify the parts of the hydrometer. [step] A hydrometer is a glass device that has a bulbous, weighted end designed to float in a liquid and a narrow, long stem with a graduated scale on the other end. It is used to measure the specific gravity of a liquid. ', references=[Reference(output='[substeps] You can get a hydrometer that is labeled for gas both from the fermentation process, and from the reading of light and air. The rated gas for fuel is 44.99 °.', tags=[]), Reference(output='Specific gravity is the density of a liquid compared to water. [substeps] The bulbous end is placed into the liquid in question while the narrow stem will stick out of the liquid.', tags=['correct']), Reference(output='[substeps] Hydrometers are often made of silver or stainless steel. When finished, the metal is much more solid, and will generally be made of real metal.', tags=[]), Reference(output='There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', tags=[])], split='valid', sub_split=None, id='id39905', perturbation=None, contrast_inputs=None, contrast_references=None), reference_index=3, request_mode='calibration', train_trial_index=0, output_mapping=None, request=Request(model='together/gpt-j-6b', prompt='Answer: There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', temperature=0, num_completions=1, top_k_per_token=1, max_tokens=0, stop_sequences=[], echo_prompt=True, top_p=1, presence_penalty=0, frequency_penalty=0, random=None), result=RequestResult(success=True, completions=[Sequence(text=' are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', logprob=-259.1124801635742, tokens=[Token(text='Ġare', logprob=0, top_logprobs={}), Token(text='Ġtwo', logprob=-5.34375, top_logprobs={'Ġthe': -3.0546875}), Token(text='Ġbrands', logprob=-8.9453125, top_logprobs={'Ġways': -3.068359375}), Token(text='Ġof', logprob=-0.88818359375, top_logprobs={'Ġof': -0.88818359375}), Token(text='Ġhyd', logprob=-8.3515625, top_logprobs={'Ġthe': -3.28125}), Token(text='rom', logprob=-2.279296875, top_logprobs={'roc': -1.3095703125}), Token(text='eters', logprob=-3.111328125, top_logprobs={'or': -0.2205810546875}), Token(text='Ġavailable', logprob=-3.212890625, top_logprobs={',': -1.759765625}), Token(text='Ġon', logprob=-2.00390625, top_logprobs={'.': -1.9560546875}), Token(text='Ġthe', logprob=-0.128173828125, top_logprobs={'Ġthe': -0.128173828125}), Token(text='Ġmarket', logprob=-0.0716552734375, top_logprobs={'Ġmarket': -0.0716552734375}), Token(text=',', logprob=-1.689453125, top_logprobs={'.': -1.158203125}), Token(text='Ġand', logprob=-2.583984375, top_logprobs={'Ġthe': -1.5615234375}), Token(text='Ġthe', logprob=-2.484375, top_logprobs={'Ġthey': -1.6640625}), Token(text='Ġones', logprob=-5.0078125, top_logprobs={'Ġone': -2.638671875}), Token(text='Ġapproved', logprob=-8.640625, top_logprobs={'ĠI': -1.529296875}), Token(text='Ġby', logprob=-0.2919921875, top_logprobs={'Ġby': -0.2919921875}), Token(text='Ġinternational', logprob=-6.9140625, top_logprobs={'Ġthe': -0.467529296875}), Token(text='Ġstandards', logprob=-1.296875, top_logprobs={'Ġstandards': -1.296875}), Token(text='Ġare', logprob=-0.82080078125, top_logprobs={'Ġare': -0.82080078125}), Token(text='Ġ:', logprob=-7.609375, top_logprobs={'Ġthe': -2.056640625}), Token(text='Ġ[', logprob=-8.3984375, top_logprobs={'Ċ': -0.80908203125}), Token(text='sub', logprob=-9.125, top_logprobs={'Table': -2.12109375}), Token(text='steps', logprob=-10.2421875, top_logprobs={'scription': -1.279296875}), Token(text=']', logprob=-1.048828125, top_logprobs={']': -1.048828125}), Token(text='ĠDi', logprob=-9.375, top_logprobs={'Ċ': -1.5693359375}), Token(text='ich', logprob=-12.6953125, top_logprobs={'ast': -0.93896484375}), Token(text='yd', logprob=-6.3828125, top_logprobs={'rom': -0.5234375}), Token(text='rase', logprob=-11.578125, top_logprobs={'rom': -0.47119140625}), Token(text='Ġ(', logprob=-3.05859375, top_logprobs={',': -2.00390625}), Token(text='blue', logprob=-8.875, top_logprobs={'D': -2.4921875}), Token(text='Ġor', logprob=-4.63671875, top_logprobs={')': -1.1455078125}), Token(text='Ġgrey', logprob=-3.5703125, top_logprobs={'Ġgreen': -1.8046875}), Token(text='Ġwithout', logprob=-9.0546875, top_logprobs={')': -1.056640625}), Token(text='Ġthe', logprob=-2.48046875, top_logprobs={'Ġa': -2.08203125}), Token(text='Ġlabel', logprob=-4.19921875, top_logprobs={'Ġletter': -3.123046875}), Token(text='Ġ"', logprob=-3.943359375, top_logprobs={')': -1.021484375}), Token(text='Ġsodium', logprob=-12.09375, top_logprobs={'Di': -3.234375}), Token(text='Ġhyd', logprob=-2.13671875, top_logprobs={'Ġchloride': -1.9267578125}), Token(text='rom', logprob=-3.501953125, top_logprobs={'rox': -0.0810546875}), Token(text='eter', logprob=-0.2010498046875, top_logprobs={'eter': -0.2010498046875}), Token(text='Ġ"', logprob=-1.9638671875, top_logprobs={'"': -1.4326171875}), Token(text=').', logprob=-3.5078125, top_logprobs={')': -1.484375}), Token(text='ĠThe', logprob=-2.291015625, top_logprobs={'Ġ[': -2.158203125}), Token(text='Ġdi', logprob=-5.1015625, top_logprobs={'Ġother': -2.2890625}), Token(text='ich', logprob=-1.560546875, top_logprobs={'hyd': -0.654296875}), Token(text='yd', logprob=-0.00273895263671875, top_logprobs={'yd': -0.00273895263671875}), Token(text='rase', logprob=-0.1517333984375, top_logprobs={'rase': -0.1517333984375}), Token(text='Ġmeter', logprob=-7.93359375, top_logprobs={'Ġis': -1.326171875}), Token(text='Ġmeasures', logprob=-4.2578125, top_logprobs={'Ġis': -0.80615234375}), Token(text='Ġthe', logprob=-0.48291015625, top_logprobs={'Ġthe': -0.48291015625}), Token(text='Ġne', logprob=-14.765625, top_logprobs={'Ġdensity': -1.13671875}), Token(text='pt', logprob=-2.80859375, top_logprobs={'ph': -1.26953125}), Token(text='une', logprob=-4.0703125, top_logprobs={'un': -0.236328125}), Token(text="'s", logprob=-2.791015625, top_logprobs={'Ġof': -2.775390625}), Token(text='Ġmass', logprob=-6.21875, top_logprobs={'Ġspecific': -0.98486328125}), Token(text='.', logprob=-2.931640625, top_logprobs={'Ġin': -1.369140625})], finish_reason={'reason': 'length'})], cached=True, request_time=0, request_datetime=None, error=None, batch_size=245, batch_request_time=4.580406188964844), num_in_context_examples=0, input_truncated=False, num_conditioning_tokens=0)

Note the part highlighted with *:

RequestState(..., 
request=Request(
  model='together/gpt-j-6b', 
  prompt=*'Answer: There are two brands* of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', 
  temperature=0, 
  num_completions=1, 
  top_k_per_token=1, 
  max_tokens=0, 
  stop_sequences=[], 
  echo_prompt=True, 
  top_p=1, 
  presence_penalty=0, 
  frequency_penalty=0, 
  random=None
), 
result=RequestResult(
  success=True, 
  completions=[Sequence(text=*' are two brands* of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', logprob=-259.1124801635742, tokens=[...]]
...
)

I'm not sure why result.completions from GPT-J have several missing tokens: Answer: There, while these tokens clearly exist in the request.prompt.

teetone commented 2 years ago

Okay, now I can reproduce this bug by the following command and input files.

venv/bin/benchmark-run -r commonsense:model=openai/text-babbage-001,dataset=hellaswag,method=multiple_choice_separate_calibrated --suite 0903
hellaswag_val.jsonl
{"ind": 11289, "ctx": "[header] How to steam broccoli without a steamer [title] Wash your broccoli. [step] Check for insects. [title] Cut the broccoli as desired.", "activity_label": "Food and Entertaining", "ctx_a": "[header] How to steam broccoli without a steamer [title] Wash your broccoli. [step] Check for insects. [title] Cut the broccoli as desired.", "ctx_b": "", "split": "val", "split_type": "indomain", "label": 2, "endings": ["[step] Using a vendor-specific bean peeler, cut the broccoli into \u00bc inch-wide strips. This works best when you use cucumbers.", "[title] Wash each piece inside out then place them in a steamer bag. [step] Save the white part if you plan to steam your broccoli afterward.", "[step] Cutting the broccoli into small pieces will help it to cook faster. [substeps] If you want to eat the stalks, they should be cut into pieces that are slightly smaller than the florets.", "[step] Rinse, drain the water, and cut once done. [title] Place your cooked broccoli into a steamer basket/pot."], "source_id": "wikihow~18617"}

And I now understand what happened by printing the original input:

AssertionError: 
Expected: [step] Using a vendor-specific bean peeler, cut the broccoli into ¼ inch-wide strips. This works best when you use cucumbers.
Actual: vendor-specific bean peeler, cut the broccoli intobytes: \xc2bytes:\xbc inch-wide strips. This works best when you use cucumbers.

So the ¼ (1 char) is extended to multiple characters by the tokenizer. Therefore, counting span length will lead to the error.

The pull request #820 cannot fix this problem as well, because it will never reach the loop break condition.

Do you have any suggestions for this? @percyliang @teetone

It seems the only perfect solution is to directly get the tokenized choices and compute len(tokenized_choice)? Or another possible solution is to filter out all the non-ascii input chars?..

It's also missing Using a in the beginning, right? Do you know why that is? I think the main problem is the missing couple of tokens in the beginning.

teetone commented 2 years ago

together/gpt-j-6b case seems much more complex, and seems the bug is from other parts.

Error:

AssertionError: 
Expected: There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune's mass.  
----> 245 chars
Actual: ĠareĠtwoĠbrandsĠofĠhydrometersĠavailableĠonĠtheĠmarket,ĠandĠtheĠonesĠapprovedĠbyĠinternationalĠstandardsĠareĠ:Ġ[substeps]ĠDiichydraseĠ(blueĠorĠgreyĠwithoutĠtheĠlabelĠ"ĠsodiumĠhydrometerĠ").ĠTheĠdiichydraseĠmeterĠmeasuresĠtheĠneptune'sĠmass.
----> 240 chars

Why the actual length < expected length? The loop will only exit if actual length >= expected length..

It seems the request is wrong. Here is the raw output from print(request_state):

RequestState(instance=Instance(input='Education and Communications: [header] How to calibrate a hydrometer [title] Identify the parts of the hydrometer. [step] A hydrometer is a glass device that has a bulbous, weighted end designed to float in a liquid and a narrow, long stem with a graduated scale on the other end. It is used to measure the specific gravity of a liquid. ', references=[Reference(output='[substeps] You can get a hydrometer that is labeled for gas both from the fermentation process, and from the reading of light and air. The rated gas for fuel is 44.99 °.', tags=[]), Reference(output='Specific gravity is the density of a liquid compared to water. [substeps] The bulbous end is placed into the liquid in question while the narrow stem will stick out of the liquid.', tags=['correct']), Reference(output='[substeps] Hydrometers are often made of silver or stainless steel. When finished, the metal is much more solid, and will generally be made of real metal.', tags=[]), Reference(output='There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', tags=[])], split='valid', sub_split=None, id='id39905', perturbation=None, contrast_inputs=None, contrast_references=None), reference_index=3, request_mode='calibration', train_trial_index=0, output_mapping=None, request=Request(model='together/gpt-j-6b', prompt='Answer: There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', temperature=0, num_completions=1, top_k_per_token=1, max_tokens=0, stop_sequences=[], echo_prompt=True, top_p=1, presence_penalty=0, frequency_penalty=0, random=None), result=RequestResult(success=True, completions=[Sequence(text=' are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', logprob=-259.1124801635742, tokens=[Token(text='Ġare', logprob=0, top_logprobs={}), Token(text='Ġtwo', logprob=-5.34375, top_logprobs={'Ġthe': -3.0546875}), Token(text='Ġbrands', logprob=-8.9453125, top_logprobs={'Ġways': -3.068359375}), Token(text='Ġof', logprob=-0.88818359375, top_logprobs={'Ġof': -0.88818359375}), Token(text='Ġhyd', logprob=-8.3515625, top_logprobs={'Ġthe': -3.28125}), Token(text='rom', logprob=-2.279296875, top_logprobs={'roc': -1.3095703125}), Token(text='eters', logprob=-3.111328125, top_logprobs={'or': -0.2205810546875}), Token(text='Ġavailable', logprob=-3.212890625, top_logprobs={',': -1.759765625}), Token(text='Ġon', logprob=-2.00390625, top_logprobs={'.': -1.9560546875}), Token(text='Ġthe', logprob=-0.128173828125, top_logprobs={'Ġthe': -0.128173828125}), Token(text='Ġmarket', logprob=-0.0716552734375, top_logprobs={'Ġmarket': -0.0716552734375}), Token(text=',', logprob=-1.689453125, top_logprobs={'.': -1.158203125}), Token(text='Ġand', logprob=-2.583984375, top_logprobs={'Ġthe': -1.5615234375}), Token(text='Ġthe', logprob=-2.484375, top_logprobs={'Ġthey': -1.6640625}), Token(text='Ġones', logprob=-5.0078125, top_logprobs={'Ġone': -2.638671875}), Token(text='Ġapproved', logprob=-8.640625, top_logprobs={'ĠI': -1.529296875}), Token(text='Ġby', logprob=-0.2919921875, top_logprobs={'Ġby': -0.2919921875}), Token(text='Ġinternational', logprob=-6.9140625, top_logprobs={'Ġthe': -0.467529296875}), Token(text='Ġstandards', logprob=-1.296875, top_logprobs={'Ġstandards': -1.296875}), Token(text='Ġare', logprob=-0.82080078125, top_logprobs={'Ġare': -0.82080078125}), Token(text='Ġ:', logprob=-7.609375, top_logprobs={'Ġthe': -2.056640625}), Token(text='Ġ[', logprob=-8.3984375, top_logprobs={'Ċ': -0.80908203125}), Token(text='sub', logprob=-9.125, top_logprobs={'Table': -2.12109375}), Token(text='steps', logprob=-10.2421875, top_logprobs={'scription': -1.279296875}), Token(text=']', logprob=-1.048828125, top_logprobs={']': -1.048828125}), Token(text='ĠDi', logprob=-9.375, top_logprobs={'Ċ': -1.5693359375}), Token(text='ich', logprob=-12.6953125, top_logprobs={'ast': -0.93896484375}), Token(text='yd', logprob=-6.3828125, top_logprobs={'rom': -0.5234375}), Token(text='rase', logprob=-11.578125, top_logprobs={'rom': -0.47119140625}), Token(text='Ġ(', logprob=-3.05859375, top_logprobs={',': -2.00390625}), Token(text='blue', logprob=-8.875, top_logprobs={'D': -2.4921875}), Token(text='Ġor', logprob=-4.63671875, top_logprobs={')': -1.1455078125}), Token(text='Ġgrey', logprob=-3.5703125, top_logprobs={'Ġgreen': -1.8046875}), Token(text='Ġwithout', logprob=-9.0546875, top_logprobs={')': -1.056640625}), Token(text='Ġthe', logprob=-2.48046875, top_logprobs={'Ġa': -2.08203125}), Token(text='Ġlabel', logprob=-4.19921875, top_logprobs={'Ġletter': -3.123046875}), Token(text='Ġ"', logprob=-3.943359375, top_logprobs={')': -1.021484375}), Token(text='Ġsodium', logprob=-12.09375, top_logprobs={'Di': -3.234375}), Token(text='Ġhyd', logprob=-2.13671875, top_logprobs={'Ġchloride': -1.9267578125}), Token(text='rom', logprob=-3.501953125, top_logprobs={'rox': -0.0810546875}), Token(text='eter', logprob=-0.2010498046875, top_logprobs={'eter': -0.2010498046875}), Token(text='Ġ"', logprob=-1.9638671875, top_logprobs={'"': -1.4326171875}), Token(text=').', logprob=-3.5078125, top_logprobs={')': -1.484375}), Token(text='ĠThe', logprob=-2.291015625, top_logprobs={'Ġ[': -2.158203125}), Token(text='Ġdi', logprob=-5.1015625, top_logprobs={'Ġother': -2.2890625}), Token(text='ich', logprob=-1.560546875, top_logprobs={'hyd': -0.654296875}), Token(text='yd', logprob=-0.00273895263671875, top_logprobs={'yd': -0.00273895263671875}), Token(text='rase', logprob=-0.1517333984375, top_logprobs={'rase': -0.1517333984375}), Token(text='Ġmeter', logprob=-7.93359375, top_logprobs={'Ġis': -1.326171875}), Token(text='Ġmeasures', logprob=-4.2578125, top_logprobs={'Ġis': -0.80615234375}), Token(text='Ġthe', logprob=-0.48291015625, top_logprobs={'Ġthe': -0.48291015625}), Token(text='Ġne', logprob=-14.765625, top_logprobs={'Ġdensity': -1.13671875}), Token(text='pt', logprob=-2.80859375, top_logprobs={'ph': -1.26953125}), Token(text='une', logprob=-4.0703125, top_logprobs={'un': -0.236328125}), Token(text="'s", logprob=-2.791015625, top_logprobs={'Ġof': -2.775390625}), Token(text='Ġmass', logprob=-6.21875, top_logprobs={'Ġspecific': -0.98486328125}), Token(text='.', logprob=-2.931640625, top_logprobs={'Ġin': -1.369140625})], finish_reason={'reason': 'length'})], cached=True, request_time=0, request_datetime=None, error=None, batch_size=245, batch_request_time=4.580406188964844), num_in_context_examples=0, input_truncated=False, num_conditioning_tokens=0)

Note the part highlighted with *:

RequestState(..., 
request=Request(
  model='together/gpt-j-6b', 
  prompt=*'Answer: There are two brands* of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', 
  temperature=0, 
  num_completions=1, 
  top_k_per_token=1, 
  max_tokens=0, 
  stop_sequences=[], 
  echo_prompt=True, 
  top_p=1, 
  presence_penalty=0, 
  frequency_penalty=0, 
  random=None
), 
result=RequestResult(
  success=True, 
  completions=[Sequence(text=*' are two brands* of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', logprob=-259.1124801635742, tokens=[...]]
...
)

I'm not sure why result.completions from GPT-J have several missing tokens: Answer: There, while these tokens clearly exist in the request.prompt.

@LorrinWWW Do you have any insight to this? This is when echo=True and max_tokens=0 for together/gpt-j-6b.

LorrinWWW commented 2 years ago

together/gpt-j-6b case seems much more complex, and seems the bug is from other parts.
Error:

AssertionError: 
Expected: There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune's mass.  
----> 245 chars
Actual: ĠareĠtwoĠbrandsĠofĠhydrometersĠavailableĠonĠtheĠmarket,ĠandĠtheĠonesĠapprovedĠbyĠinternationalĠstandardsĠareĠ:Ġ[substeps]ĠDiichydraseĠ(blueĠorĠgreyĠwithoutĠtheĠlabelĠ"ĠsodiumĠhydrometerĠ").ĠTheĠdiichydraseĠmeterĠmeasuresĠtheĠneptune'sĠmass.
----> 240 chars

Why the actual length \< expected length? The loop will only exit if actual length >= expected length..
It seems the request is wrong. Here is the raw output from print(request_state):

RequestState(instance=Instance(input='Education and Communications: [header] How to calibrate a hydrometer [title] Identify the parts of the hydrometer. [step] A hydrometer is a glass device that has a bulbous, weighted end designed to float in a liquid and a narrow, long stem with a graduated scale on the other end. It is used to measure the specific gravity of a liquid. ', references=[Reference(output='[substeps] You can get a hydrometer that is labeled for gas both from the fermentation process, and from the reading of light and air. The rated gas for fuel is 44.99 °.', tags=[]), Reference(output='Specific gravity is the density of a liquid compared to water. [substeps] The bulbous end is placed into the liquid in question while the narrow stem will stick out of the liquid.', tags=['correct']), Reference(output='[substeps] Hydrometers are often made of silver or stainless steel. When finished, the metal is much more solid, and will generally be made of real metal.', tags=[]), Reference(output='There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', tags=[])], split='valid', sub_split=None, id='id39905', perturbation=None, contrast_inputs=None, contrast_references=None), reference_index=3, request_mode='calibration', train_trial_index=0, output_mapping=None, request=Request(model='together/gpt-j-6b', prompt='Answer: There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', temperature=0, num_completions=1, top_k_per_token=1, max_tokens=0, stop_sequences=[], echo_prompt=True, top_p=1, presence_penalty=0, frequency_penalty=0, random=None), result=RequestResult(success=True, completions=[Sequence(text=' are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', logprob=-259.1124801635742, tokens=[Token(text='Ġare', logprob=0, top_logprobs={}), Token(text='Ġtwo', logprob=-5.34375, top_logprobs={'Ġthe': -3.0546875}), Token(text='Ġbrands', logprob=-8.9453125, top_logprobs={'Ġways': -3.068359375}), Token(text='Ġof', logprob=-0.88818359375, top_logprobs={'Ġof': -0.88818359375}), Token(text='Ġhyd', logprob=-8.3515625, top_logprobs={'Ġthe': -3.28125}), Token(text='rom', logprob=-2.279296875, top_logprobs={'roc': -1.3095703125}), Token(text='eters', logprob=-3.111328125, top_logprobs={'or': -0.2205810546875}), Token(text='Ġavailable', logprob=-3.212890625, top_logprobs={',': -1.759765625}), Token(text='Ġon', logprob=-2.00390625, top_logprobs={'.': -1.9560546875}), Token(text='Ġthe', logprob=-0.128173828125, top_logprobs={'Ġthe': -0.128173828125}), Token(text='Ġmarket', logprob=-0.0716552734375, top_logprobs={'Ġmarket': -0.0716552734375}), Token(text=',', logprob=-1.689453125, top_logprobs={'.': -1.158203125}), Token(text='Ġand', logprob=-2.583984375, top_logprobs={'Ġthe': -1.5615234375}), Token(text='Ġthe', logprob=-2.484375, top_logprobs={'Ġthey': -1.6640625}), Token(text='Ġones', logprob=-5.0078125, top_logprobs={'Ġone': -2.638671875}), Token(text='Ġapproved', logprob=-8.640625, top_logprobs={'ĠI': -1.529296875}), Token(text='Ġby', logprob=-0.2919921875, top_logprobs={'Ġby': -0.2919921875}), Token(text='Ġinternational', logprob=-6.9140625, top_logprobs={'Ġthe': -0.467529296875}), Token(text='Ġstandards', logprob=-1.296875, top_logprobs={'Ġstandards': -1.296875}), Token(text='Ġare', logprob=-0.82080078125, top_logprobs={'Ġare': -0.82080078125}), Token(text='Ġ:', logprob=-7.609375, top_logprobs={'Ġthe': -2.056640625}), Token(text='Ġ[', logprob=-8.3984375, top_logprobs={'Ċ': -0.80908203125}), Token(text='sub', logprob=-9.125, top_logprobs={'Table': -2.12109375}), Token(text='steps', logprob=-10.2421875, top_logprobs={'scription': -1.279296875}), Token(text=']', logprob=-1.048828125, top_logprobs={']': -1.048828125}), Token(text='ĠDi', logprob=-9.375, top_logprobs={'Ċ': -1.5693359375}), Token(text='ich', logprob=-12.6953125, top_logprobs={'ast': -0.93896484375}), Token(text='yd', logprob=-6.3828125, top_logprobs={'rom': -0.5234375}), Token(text='rase', logprob=-11.578125, top_logprobs={'rom': -0.47119140625}), Token(text='Ġ(', logprob=-3.05859375, top_logprobs={',': -2.00390625}), Token(text='blue', logprob=-8.875, top_logprobs={'D': -2.4921875}), Token(text='Ġor', logprob=-4.63671875, top_logprobs={')': -1.1455078125}), Token(text='Ġgrey', logprob=-3.5703125, top_logprobs={'Ġgreen': -1.8046875}), Token(text='Ġwithout', logprob=-9.0546875, top_logprobs={')': -1.056640625}), Token(text='Ġthe', logprob=-2.48046875, top_logprobs={'Ġa': -2.08203125}), Token(text='Ġlabel', logprob=-4.19921875, top_logprobs={'Ġletter': -3.123046875}), Token(text='Ġ"', logprob=-3.943359375, top_logprobs={')': -1.021484375}), Token(text='Ġsodium', logprob=-12.09375, top_logprobs={'Di': -3.234375}), Token(text='Ġhyd', logprob=-2.13671875, top_logprobs={'Ġchloride': -1.9267578125}), Token(text='rom', logprob=-3.501953125, top_logprobs={'rox': -0.0810546875}), Token(text='eter', logprob=-0.2010498046875, top_logprobs={'eter': -0.2010498046875}), Token(text='Ġ"', logprob=-1.9638671875, top_logprobs={'"': -1.4326171875}), Token(text=').', logprob=-3.5078125, top_logprobs={')': -1.484375}), Token(text='ĠThe', logprob=-2.291015625, top_logprobs={'Ġ[': -2.158203125}), Token(text='Ġdi', logprob=-5.1015625, top_logprobs={'Ġother': -2.2890625}), Token(text='ich', logprob=-1.560546875, top_logprobs={'hyd': -0.654296875}), Token(text='yd', logprob=-0.00273895263671875, top_logprobs={'yd': -0.00273895263671875}), Token(text='rase', logprob=-0.1517333984375, top_logprobs={'rase': -0.1517333984375}), Token(text='Ġmeter', logprob=-7.93359375, top_logprobs={'Ġis': -1.326171875}), Token(text='Ġmeasures', logprob=-4.2578125, top_logprobs={'Ġis': -0.80615234375}), Token(text='Ġthe', logprob=-0.48291015625, top_logprobs={'Ġthe': -0.48291015625}), Token(text='Ġne', logprob=-14.765625, top_logprobs={'Ġdensity': -1.13671875}), Token(text='pt', logprob=-2.80859375, top_logprobs={'ph': -1.26953125}), Token(text='une', logprob=-4.0703125, top_logprobs={'un': -0.236328125}), Token(text="'s", logprob=-2.791015625, top_logprobs={'Ġof': -2.775390625}), Token(text='Ġmass', logprob=-6.21875, top_logprobs={'Ġspecific': -0.98486328125}), Token(text='.', logprob=-2.931640625, top_logprobs={'Ġin': -1.369140625})], finish_reason={'reason': 'length'})], cached=True, request_time=0, request_datetime=None, error=None, batch_size=245, batch_request_time=4.580406188964844), num_in_context_examples=0, input_truncated=False, num_conditioning_tokens=0)

Note the part highlighted with *:

RequestState(..., 
request=Request(
  model='together/gpt-j-6b', 
  prompt=*'Answer: There are two brands* of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', 
  temperature=0, 
  num_completions=1, 
  top_k_per_token=1, 
  max_tokens=0, 
  stop_sequences=[], 
  echo_prompt=True, 
  top_p=1, 
  presence_penalty=0, 
  frequency_penalty=0, 
  random=None
), 
result=RequestResult(
  success=True, 
  completions=[Sequence(text=*' are two brands* of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', logprob=-259.1124801635742, tokens=[...]]
...
)

I'm not sure why result.completions from GPT-J have several missing tokens: Answer: There, while these tokens clearly exist in the request.prompt.

@LorrinWWW Do you have any insight to this? This is when echo=True and max_tokens=0 for together/gpt-j-6b.

I looked into for Bloom and OPT. It could be a truncation/padding issue in an earlier version (from our side) and the current version does not have this issue. (I guess it should be the same case for GPT-J). @teetone Could you send me all requests where the result.completions does not match request.prompt? I will rerun and check them before sending to you. Thanks!

teetone commented 2 years ago

together/gpt-j-6b case seems much more complex, and seems the bug is from other parts. Error:

AssertionError: 
Expected: There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune's mass.  
----> 245 chars
Actual: ĠareĠtwoĠbrandsĠofĠhydrometersĠavailableĠonĠtheĠmarket,ĠandĠtheĠonesĠapprovedĠbyĠinternationalĠstandardsĠareĠ:Ġ[substeps]ĠDiichydraseĠ(blueĠorĠgreyĠwithoutĠtheĠlabelĠ"ĠsodiumĠhydrometerĠ").ĠTheĠdiichydraseĠmeterĠmeasuresĠtheĠneptune'sĠmass.
----> 240 chars

Why the actual length < expected length? The loop will only exit if actual length >= expected length.. It seems the request is wrong. Here is the raw output from print(request_state):

RequestState(instance=Instance(input='Education and Communications: [header] How to calibrate a hydrometer [title] Identify the parts of the hydrometer. [step] A hydrometer is a glass device that has a bulbous, weighted end designed to float in a liquid and a narrow, long stem with a graduated scale on the other end. It is used to measure the specific gravity of a liquid. ', references=[Reference(output='[substeps] You can get a hydrometer that is labeled for gas both from the fermentation process, and from the reading of light and air. The rated gas for fuel is 44.99 °.', tags=[]), Reference(output='Specific gravity is the density of a liquid compared to water. [substeps] The bulbous end is placed into the liquid in question while the narrow stem will stick out of the liquid.', tags=['correct']), Reference(output='[substeps] Hydrometers are often made of silver or stainless steel. When finished, the metal is much more solid, and will generally be made of real metal.', tags=[]), Reference(output='There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', tags=[])], split='valid', sub_split=None, id='id39905', perturbation=None, contrast_inputs=None, contrast_references=None), reference_index=3, request_mode='calibration', train_trial_index=0, output_mapping=None, request=Request(model='together/gpt-j-6b', prompt='Answer: There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', temperature=0, num_completions=1, top_k_per_token=1, max_tokens=0, stop_sequences=[], echo_prompt=True, top_p=1, presence_penalty=0, frequency_penalty=0, random=None), result=RequestResult(success=True, completions=[Sequence(text=' are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', logprob=-259.1124801635742, tokens=[Token(text='Ġare', logprob=0, top_logprobs={}), Token(text='Ġtwo', logprob=-5.34375, top_logprobs={'Ġthe': -3.0546875}), Token(text='Ġbrands', logprob=-8.9453125, top_logprobs={'Ġways': -3.068359375}), Token(text='Ġof', logprob=-0.88818359375, top_logprobs={'Ġof': -0.88818359375}), Token(text='Ġhyd', logprob=-8.3515625, top_logprobs={'Ġthe': -3.28125}), Token(text='rom', logprob=-2.279296875, top_logprobs={'roc': -1.3095703125}), Token(text='eters', logprob=-3.111328125, top_logprobs={'or': -0.2205810546875}), Token(text='Ġavailable', logprob=-3.212890625, top_logprobs={',': -1.759765625}), Token(text='Ġon', logprob=-2.00390625, top_logprobs={'.': -1.9560546875}), Token(text='Ġthe', logprob=-0.128173828125, top_logprobs={'Ġthe': -0.128173828125}), Token(text='Ġmarket', logprob=-0.0716552734375, top_logprobs={'Ġmarket': -0.0716552734375}), Token(text=',', logprob=-1.689453125, top_logprobs={'.': -1.158203125}), Token(text='Ġand', logprob=-2.583984375, top_logprobs={'Ġthe': -1.5615234375}), Token(text='Ġthe', logprob=-2.484375, top_logprobs={'Ġthey': -1.6640625}), Token(text='Ġones', logprob=-5.0078125, top_logprobs={'Ġone': -2.638671875}), Token(text='Ġapproved', logprob=-8.640625, top_logprobs={'ĠI': -1.529296875}), Token(text='Ġby', logprob=-0.2919921875, top_logprobs={'Ġby': -0.2919921875}), Token(text='Ġinternational', logprob=-6.9140625, top_logprobs={'Ġthe': -0.467529296875}), Token(text='Ġstandards', logprob=-1.296875, top_logprobs={'Ġstandards': -1.296875}), Token(text='Ġare', logprob=-0.82080078125, top_logprobs={'Ġare': -0.82080078125}), Token(text='Ġ:', logprob=-7.609375, top_logprobs={'Ġthe': -2.056640625}), Token(text='Ġ[', logprob=-8.3984375, top_logprobs={'Ċ': -0.80908203125}), Token(text='sub', logprob=-9.125, top_logprobs={'Table': -2.12109375}), Token(text='steps', logprob=-10.2421875, top_logprobs={'scription': -1.279296875}), Token(text=']', logprob=-1.048828125, top_logprobs={']': -1.048828125}), Token(text='ĠDi', logprob=-9.375, top_logprobs={'Ċ': -1.5693359375}), Token(text='ich', logprob=-12.6953125, top_logprobs={'ast': -0.93896484375}), Token(text='yd', logprob=-6.3828125, top_logprobs={'rom': -0.5234375}), Token(text='rase', logprob=-11.578125, top_logprobs={'rom': -0.47119140625}), Token(text='Ġ(', logprob=-3.05859375, top_logprobs={',': -2.00390625}), Token(text='blue', logprob=-8.875, top_logprobs={'D': -2.4921875}), Token(text='Ġor', logprob=-4.63671875, top_logprobs={')': -1.1455078125}), Token(text='Ġgrey', logprob=-3.5703125, top_logprobs={'Ġgreen': -1.8046875}), Token(text='Ġwithout', logprob=-9.0546875, top_logprobs={')': -1.056640625}), Token(text='Ġthe', logprob=-2.48046875, top_logprobs={'Ġa': -2.08203125}), Token(text='Ġlabel', logprob=-4.19921875, top_logprobs={'Ġletter': -3.123046875}), Token(text='Ġ"', logprob=-3.943359375, top_logprobs={')': -1.021484375}), Token(text='Ġsodium', logprob=-12.09375, top_logprobs={'Di': -3.234375}), Token(text='Ġhyd', logprob=-2.13671875, top_logprobs={'Ġchloride': -1.9267578125}), Token(text='rom', logprob=-3.501953125, top_logprobs={'rox': -0.0810546875}), Token(text='eter', logprob=-0.2010498046875, top_logprobs={'eter': -0.2010498046875}), Token(text='Ġ"', logprob=-1.9638671875, top_logprobs={'"': -1.4326171875}), Token(text=').', logprob=-3.5078125, top_logprobs={')': -1.484375}), Token(text='ĠThe', logprob=-2.291015625, top_logprobs={'Ġ[': -2.158203125}), Token(text='Ġdi', logprob=-5.1015625, top_logprobs={'Ġother': -2.2890625}), Token(text='ich', logprob=-1.560546875, top_logprobs={'hyd': -0.654296875}), Token(text='yd', logprob=-0.00273895263671875, top_logprobs={'yd': -0.00273895263671875}), Token(text='rase', logprob=-0.1517333984375, top_logprobs={'rase': -0.1517333984375}), Token(text='Ġmeter', logprob=-7.93359375, top_logprobs={'Ġis': -1.326171875}), Token(text='Ġmeasures', logprob=-4.2578125, top_logprobs={'Ġis': -0.80615234375}), Token(text='Ġthe', logprob=-0.48291015625, top_logprobs={'Ġthe': -0.48291015625}), Token(text='Ġne', logprob=-14.765625, top_logprobs={'Ġdensity': -1.13671875}), Token(text='pt', logprob=-2.80859375, top_logprobs={'ph': -1.26953125}), Token(text='une', logprob=-4.0703125, top_logprobs={'un': -0.236328125}), Token(text="'s", logprob=-2.791015625, top_logprobs={'Ġof': -2.775390625}), Token(text='Ġmass', logprob=-6.21875, top_logprobs={'Ġspecific': -0.98486328125}), Token(text='.', logprob=-2.931640625, top_logprobs={'Ġin': -1.369140625})], finish_reason={'reason': 'length'})], cached=True, request_time=0, request_datetime=None, error=None, batch_size=245, batch_request_time=4.580406188964844), num_in_context_examples=0, input_truncated=False, num_conditioning_tokens=0)

Note the part highlighted with *:

RequestState(..., 
request=Request(
  model='together/gpt-j-6b', 
  prompt=*'Answer: There are two brands* of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', 
  temperature=0, 
  num_completions=1, 
  top_k_per_token=1, 
  max_tokens=0, 
  stop_sequences=[], 
  echo_prompt=True, 
  top_p=1, 
  presence_penalty=0, 
  frequency_penalty=0, 
  random=None
), 
result=RequestResult(
  success=True, 
  completions=[Sequence(text=*' are two brands* of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', logprob=-259.1124801635742, tokens=[...]]
...
)

I'm not sure why result.completions from GPT-J have several missing tokens: Answer: There, while these tokens clearly exist in the request.prompt.

@LorrinWWW Do you have any insight to this? This is when echo=True and max_tokens=0 for together/gpt-j-6b.

I looked into for Bloom and OPT. It could be a truncation/padding issue in an earlier version (from our side) and the current version does not have this issue. (I guess it should be the same case for GPT-J). @teetone Could you send me all requests where the result.completions does not match request.prompt? I will rerun and check them before sending to you. Thanks!

To be safe, could we regenerate results for queries where echo=prompt for those models? It shouldn't be too many.

LorrinWWW commented 2 years ago

together/gpt-j-6b case seems much more complex, and seems the bug is from other parts.
Error:

AssertionError: 
Expected: There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune's mass.  
----> 245 chars
Actual: ĠareĠtwoĠbrandsĠofĠhydrometersĠavailableĠonĠtheĠmarket,ĠandĠtheĠonesĠapprovedĠbyĠinternationalĠstandardsĠareĠ:Ġ[substeps]ĠDiichydraseĠ(blueĠorĠgreyĠwithoutĠtheĠlabelĠ"ĠsodiumĠhydrometerĠ").ĠTheĠdiichydraseĠmeterĠmeasuresĠtheĠneptune'sĠmass.
----> 240 chars

Why the actual length \< expected length? The loop will only exit if actual length >= expected length..
It seems the request is wrong. Here is the raw output from print(request_state):

RequestState(instance=Instance(input='Education and Communications: [header] How to calibrate a hydrometer [title] Identify the parts of the hydrometer. [step] A hydrometer is a glass device that has a bulbous, weighted end designed to float in a liquid and a narrow, long stem with a graduated scale on the other end. It is used to measure the specific gravity of a liquid. ', references=[Reference(output='[substeps] You can get a hydrometer that is labeled for gas both from the fermentation process, and from the reading of light and air. The rated gas for fuel is 44.99 °.', tags=[]), Reference(output='Specific gravity is the density of a liquid compared to water. [substeps] The bulbous end is placed into the liquid in question while the narrow stem will stick out of the liquid.', tags=['correct']), Reference(output='[substeps] Hydrometers are often made of silver or stainless steel. When finished, the metal is much more solid, and will generally be made of real metal.', tags=[]), Reference(output='There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', tags=[])], split='valid', sub_split=None, id='id39905', perturbation=None, contrast_inputs=None, contrast_references=None), reference_index=3, request_mode='calibration', train_trial_index=0, output_mapping=None, request=Request(model='together/gpt-j-6b', prompt='Answer: There are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', temperature=0, num_completions=1, top_k_per_token=1, max_tokens=0, stop_sequences=[], echo_prompt=True, top_p=1, presence_penalty=0, frequency_penalty=0, random=None), result=RequestResult(success=True, completions=[Sequence(text=' are two brands of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', logprob=-259.1124801635742, tokens=[Token(text='Ġare', logprob=0, top_logprobs={}), Token(text='Ġtwo', logprob=-5.34375, top_logprobs={'Ġthe': -3.0546875}), Token(text='Ġbrands', logprob=-8.9453125, top_logprobs={'Ġways': -3.068359375}), Token(text='Ġof', logprob=-0.88818359375, top_logprobs={'Ġof': -0.88818359375}), Token(text='Ġhyd', logprob=-8.3515625, top_logprobs={'Ġthe': -3.28125}), Token(text='rom', logprob=-2.279296875, top_logprobs={'roc': -1.3095703125}), Token(text='eters', logprob=-3.111328125, top_logprobs={'or': -0.2205810546875}), Token(text='Ġavailable', logprob=-3.212890625, top_logprobs={',': -1.759765625}), Token(text='Ġon', logprob=-2.00390625, top_logprobs={'.': -1.9560546875}), Token(text='Ġthe', logprob=-0.128173828125, top_logprobs={'Ġthe': -0.128173828125}), Token(text='Ġmarket', logprob=-0.0716552734375, top_logprobs={'Ġmarket': -0.0716552734375}), Token(text=',', logprob=-1.689453125, top_logprobs={'.': -1.158203125}), Token(text='Ġand', logprob=-2.583984375, top_logprobs={'Ġthe': -1.5615234375}), Token(text='Ġthe', logprob=-2.484375, top_logprobs={'Ġthey': -1.6640625}), Token(text='Ġones', logprob=-5.0078125, top_logprobs={'Ġone': -2.638671875}), Token(text='Ġapproved', logprob=-8.640625, top_logprobs={'ĠI': -1.529296875}), Token(text='Ġby', logprob=-0.2919921875, top_logprobs={'Ġby': -0.2919921875}), Token(text='Ġinternational', logprob=-6.9140625, top_logprobs={'Ġthe': -0.467529296875}), Token(text='Ġstandards', logprob=-1.296875, top_logprobs={'Ġstandards': -1.296875}), Token(text='Ġare', logprob=-0.82080078125, top_logprobs={'Ġare': -0.82080078125}), Token(text='Ġ:', logprob=-7.609375, top_logprobs={'Ġthe': -2.056640625}), Token(text='Ġ[', logprob=-8.3984375, top_logprobs={'Ċ': -0.80908203125}), Token(text='sub', logprob=-9.125, top_logprobs={'Table': -2.12109375}), Token(text='steps', logprob=-10.2421875, top_logprobs={'scription': -1.279296875}), Token(text=']', logprob=-1.048828125, top_logprobs={']': -1.048828125}), Token(text='ĠDi', logprob=-9.375, top_logprobs={'Ċ': -1.5693359375}), Token(text='ich', logprob=-12.6953125, top_logprobs={'ast': -0.93896484375}), Token(text='yd', logprob=-6.3828125, top_logprobs={'rom': -0.5234375}), Token(text='rase', logprob=-11.578125, top_logprobs={'rom': -0.47119140625}), Token(text='Ġ(', logprob=-3.05859375, top_logprobs={',': -2.00390625}), Token(text='blue', logprob=-8.875, top_logprobs={'D': -2.4921875}), Token(text='Ġor', logprob=-4.63671875, top_logprobs={')': -1.1455078125}), Token(text='Ġgrey', logprob=-3.5703125, top_logprobs={'Ġgreen': -1.8046875}), Token(text='Ġwithout', logprob=-9.0546875, top_logprobs={')': -1.056640625}), Token(text='Ġthe', logprob=-2.48046875, top_logprobs={'Ġa': -2.08203125}), Token(text='Ġlabel', logprob=-4.19921875, top_logprobs={'Ġletter': -3.123046875}), Token(text='Ġ"', logprob=-3.943359375, top_logprobs={')': -1.021484375}), Token(text='Ġsodium', logprob=-12.09375, top_logprobs={'Di': -3.234375}), Token(text='Ġhyd', logprob=-2.13671875, top_logprobs={'Ġchloride': -1.9267578125}), Token(text='rom', logprob=-3.501953125, top_logprobs={'rox': -0.0810546875}), Token(text='eter', logprob=-0.2010498046875, top_logprobs={'eter': -0.2010498046875}), Token(text='Ġ"', logprob=-1.9638671875, top_logprobs={'"': -1.4326171875}), Token(text=').', logprob=-3.5078125, top_logprobs={')': -1.484375}), Token(text='ĠThe', logprob=-2.291015625, top_logprobs={'Ġ[': -2.158203125}), Token(text='Ġdi', logprob=-5.1015625, top_logprobs={'Ġother': -2.2890625}), Token(text='ich', logprob=-1.560546875, top_logprobs={'hyd': -0.654296875}), Token(text='yd', logprob=-0.00273895263671875, top_logprobs={'yd': -0.00273895263671875}), Token(text='rase', logprob=-0.1517333984375, top_logprobs={'rase': -0.1517333984375}), Token(text='Ġmeter', logprob=-7.93359375, top_logprobs={'Ġis': -1.326171875}), Token(text='Ġmeasures', logprob=-4.2578125, top_logprobs={'Ġis': -0.80615234375}), Token(text='Ġthe', logprob=-0.48291015625, top_logprobs={'Ġthe': -0.48291015625}), Token(text='Ġne', logprob=-14.765625, top_logprobs={'Ġdensity': -1.13671875}), Token(text='pt', logprob=-2.80859375, top_logprobs={'ph': -1.26953125}), Token(text='une', logprob=-4.0703125, top_logprobs={'un': -0.236328125}), Token(text="'s", logprob=-2.791015625, top_logprobs={'Ġof': -2.775390625}), Token(text='Ġmass', logprob=-6.21875, top_logprobs={'Ġspecific': -0.98486328125}), Token(text='.', logprob=-2.931640625, top_logprobs={'Ġin': -1.369140625})], finish_reason={'reason': 'length'})], cached=True, request_time=0, request_datetime=None, error=None, batch_size=245, batch_request_time=4.580406188964844), num_in_context_examples=0, input_truncated=False, num_conditioning_tokens=0)

Note the part highlighted with *:

RequestState(..., 
request=Request(
  model='together/gpt-j-6b', 
  prompt=*'Answer: There are two brands* of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', 
  temperature=0, 
  num_completions=1, 
  top_k_per_token=1, 
  max_tokens=0, 
  stop_sequences=[], 
  echo_prompt=True, 
  top_p=1, 
  presence_penalty=0, 
  frequency_penalty=0, 
  random=None
), 
result=RequestResult(
  success=True, 
  completions=[Sequence(text=*' are two brands* of hydrometers available on the market, and the ones approved by international standards are : [substeps] Diichydrase (blue or grey without the label " sodium hydrometer "). The diichydrase meter measures the neptune\'s mass.', logprob=-259.1124801635742, tokens=[...]]
...
)

I'm not sure why result.completions from GPT-J have several missing tokens: Answer: There, while these tokens clearly exist in the request.prompt.

@LorrinWWW Do you have any insight to this? This is when echo=True and max_tokens=0 for together/gpt-j-6b.

I looked into for Bloom and OPT. It could be a truncation/padding issue in an earlier version (from our side) and the current version does not have this issue. (I guess it should be the same case for GPT-J). @teetone Could you send me all requests where the result.completions does not match request.prompt? I will rerun and check them before sending to you. Thanks!

To be safe, could we regenerate results for queries where echo=prompt for those models? It shouldn't be too many.

Sure

yuhui-zh15 commented 2 years ago

Okay, now I can reproduce this bug by the following command and input files.

venv/bin/benchmark-run -r commonsense:model=openai/text-babbage-001,dataset=hellaswag,method=multiple_choice_separate_calibrated --suite 0903
hellaswag_val.jsonl
{"ind": 11289, "ctx": "[header] How to steam broccoli without a steamer [title] Wash your broccoli. [step] Check for insects. [title] Cut the broccoli as desired.", "activity_label": "Food and Entertaining", "ctx_a": "[header] How to steam broccoli without a steamer [title] Wash your broccoli. [step] Check for insects. [title] Cut the broccoli as desired.", "ctx_b": "", "split": "val", "split_type": "indomain", "label": 2, "endings": ["[step] Using a vendor-specific bean peeler, cut the broccoli into \u00bc inch-wide strips. This works best when you use cucumbers.", "[title] Wash each piece inside out then place them in a steamer bag. [step] Save the white part if you plan to steam your broccoli afterward.", "[step] Cutting the broccoli into small pieces will help it to cook faster. [substeps] If you want to eat the stalks, they should be cut into pieces that are slightly smaller than the florets.", "[step] Rinse, drain the water, and cut once done. [title] Place your cooked broccoli into a steamer basket/pot."], "source_id": "wikihow~18617"}

And I now understand what happened by printing the original input:

AssertionError: 
Expected: [step] Using a vendor-specific bean peeler, cut the broccoli into ¼ inch-wide strips. This works best when you use cucumbers.
Actual: vendor-specific bean peeler, cut the broccoli intobytes: \xc2bytes:\xbc inch-wide strips. This works best when you use cucumbers.

So the ¼ (1 char) is extended to multiple characters by the tokenizer. Therefore, counting span length will lead to the error. The pull request #820 cannot fix this problem as well, because it will never reach the loop break condition. Do you have any suggestions for this? @percyliang @teetone It seems the only perfect solution is to directly get the tokenized choices and compute len(tokenized_choice)? Or another possible solution is to filter out all the non-ascii input chars?..

It's also missing Using a in the beginning, right? Do you know why that is? I think the main problem is the missing couple of tokens in the beginning.

Yes, this is because we select the span by counting the length of last tokens.

For example, if len(reference) = 125, we will keep appending tokens from completions until len(span) >= 125.

# Pseudo-code for span selection
span_tokens = []
for token in in completions[::-1]:
    if len(''.join(span_tokens)) >= len(reference):
          break
    span_tokens.prepend(token)

Because the tokenizer now extends ¼ (1 char) to bytes:\xbc bytes: \xc2 (21 chars). The loop will end earlier and thus first several tokens are missing.

Here is the output from print(token, len(token)) during the for token in completions[::-1]:

Completions:
. 1
umbers 6
 cuc 4
 use 4
 you 4
 when 5
 best 5
 works 6
 This 5
. 1
 strips 7
wide 4
- 1
 inch 5
bytes:\xbc 10
bytes: \xc2 11
 into 5
 broccoli 9
 the 4
 cut 4
, 1
er 2
 peel 5
 bean 5
specific 8
- 1
 vendor 7
---> length = 130

Reference:
Food and Entertaining: [header] How to steam broccoli without a steamer [title] Wash your broccoli. [step] Check for insects. [title] Cut the broccoli as desired.  [step] Using a vendor-specific bean peeler, cut the broccoli intobytes: \xc2bytes:\xbc inch-wide strips. This works best when you use cucumbers.
---> length = 125

Therefore, this is a bug induced by the tokenizer: the tokenizer should not change the length of inputs. This can be fixed with a detokenizer, but we don't have this component in the current code base.

teetone commented 2 years ago

Okay, now I can reproduce this bug by the following command and input files.

venv/bin/benchmark-run -r commonsense:model=openai/text-babbage-001,dataset=hellaswag,method=multiple_choice_separate_calibrated --suite 0903
hellaswag_val.jsonl
{"ind": 11289, "ctx": "[header] How to steam broccoli without a steamer [title] Wash your broccoli. [step] Check for insects. [title] Cut the broccoli as desired.", "activity_label": "Food and Entertaining", "ctx_a": "[header] How to steam broccoli without a steamer [title] Wash your broccoli. [step] Check for insects. [title] Cut the broccoli as desired.", "ctx_b": "", "split": "val", "split_type": "indomain", "label": 2, "endings": ["[step] Using a vendor-specific bean peeler, cut the broccoli into \u00bc inch-wide strips. This works best when you use cucumbers.", "[title] Wash each piece inside out then place them in a steamer bag. [step] Save the white part if you plan to steam your broccoli afterward.", "[step] Cutting the broccoli into small pieces will help it to cook faster. [substeps] If you want to eat the stalks, they should be cut into pieces that are slightly smaller than the florets.", "[step] Rinse, drain the water, and cut once done. [title] Place your cooked broccoli into a steamer basket/pot."], "source_id": "wikihow~18617"}

And I now understand what happened by printing the original input:

AssertionError: 
Expected: [step] Using a vendor-specific bean peeler, cut the broccoli into ¼ inch-wide strips. This works best when you use cucumbers.
Actual: vendor-specific bean peeler, cut the broccoli intobytes: \xc2bytes:\xbc inch-wide strips. This works best when you use cucumbers.

So the ¼ (1 char) is extended to multiple characters by the tokenizer. Therefore, counting span length will lead to the error. The pull request #820 cannot fix this problem as well, because it will never reach the loop break condition. Do you have any suggestions for this? @percyliang @teetone It seems the only perfect solution is to directly get the tokenized choices and compute len(tokenized_choice)? Or another possible solution is to filter out all the non-ascii input chars?..

It's also missing Using a in the beginning, right? Do you know why that is? I think the main problem is the missing couple of tokens in the beginning.

Yes, this is because we select the span by counting the length of last tokens.

For example, if len(reference) = 125, we will keep appending tokens from completions until len(span) >= 125.

# Pseudo-code for span selection
span_tokens = []
for token in in completions[::-1]:
    if len(''.join(span_tokens)) >= len(reference):
          break
    span_tokens.prepend(token)

Because the tokenizer now extends ¼ (1 char) to bytes:\xbc bytes: \xc2 (21 chars). The loop will end earlier and thus first several tokens are missing.

Here is the output from print(token, len(token)) during the for token in completions[::-1]:

Completions:
. 1
umbers 6
 cuc 4
 use 4
 you 4
 when 5
 best 5
 works 6
 This 5
. 1
 strips 7
wide 4
- 1
 inch 5
bytes:\xbc 10
bytes: \xc2 11
 into 5
 broccoli 9
 the 4
 cut 4
, 1
er 2
 peel 5
 bean 5
specific 8
- 1
 vendor 7
---> length = 130

Reference:
Food and Entertaining: [header] How to steam broccoli without a steamer [title] Wash your broccoli. [step] Check for insects. [title] Cut the broccoli as desired.  [step] Using a vendor-specific bean peeler, cut the broccoli intobytes: \xc2bytes:\xbc inch-wide strips. This works best when you use cucumbers.
---> length = 125

Therefore, this is a bug induced by the tokenizer: the tokenizer should not change the length of inputs. This can be fixed with a detokenizer, but we don't have this component in the current code base.

We have decode of WindowService and WindowServiceFactory. There are examples in the code of how they are used.

yuhui-zh15 commented 2 years ago

Thanks for pointing this out! #826 should be a perfect fix.