jalammar / ecco

Explain, analyze, and visualize NLP language models. Ecco creates interactive visualizations directly in Jupyter notebooks explaining the behavior of Transformer-based language models (like GPT2, BERT, RoBERTA, T5, and T0).
https://ecco.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.96k stars 167 forks source link

lm.generate and HuggingFace's generate give different results with do_sample=False #92

Closed noeliaferruz closed 1 year ago

noeliaferruz commented 1 year ago

Hi, thanks for the great work.

I'm generating text with the lm function without sampling:

output = lm.generate(text, generate=200,max_length=1024,
        eos_token_id=1, pad_token_id=0, attribution=['grad_x_input','ig'])

Then using the original HuggingFace library using the same code as in ecco (by literally copying the function from lm.py):

outputs=model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            num_beams=1,
            # FIXME: +1 in max_length to account for first start token in decoder, find a better way to do this
            max_length=1024,
            do_sample=False,
            top_p=None,
            top_k=None,
            temperature=1,
            eos_token_id=1, pad_token_id=0,
            return_dict_in_generate=True,
            output_scores=True)

In both cases, we use the same seed (MKDIDTLISNNAL). But the first method produces this: WSKMLVEEDPGFFERLSQAQKPRALFITCSDSRLVPEQ. And the second produces this: WSKMLVEEDPGFFEKLAQAQKPRFLWIGCSDSRVPAERL.

The two functions should give the same sequence of tokens since we are not sampling. There must be a bug in how lm.generates the tokens iteratively (we know the second sequence is the right one).