jalammar / ecco

Explain, analyze, and visualize NLP language models. Ecco creates interactive visualizations directly in Jupyter notebooks explaining the behavior of Transformer-based language models (like GPT2, BERT, RoBERTA, T5, and T0).
https://ecco.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.96k stars 167 forks source link

Presence of character Ġ before each token in output #87

Open osmanii opened 1 year ago

osmanii commented 1 year ago

I was working on the "05- Neuron Factors.ipynb" notebook and noticed the presence of character Ġ before each token in the output. The output is for the code "nmf_1.explore()". I am not quite sure why it is doing that. Please check the screenshot below.

image

Your help is appreciated.

cristianestojeda commented 1 year ago

Happened to me using GPT-2 and solved this issue by adding the following line: if self.config['token_prefix'] is not None and token[0] == self.config['token_prefix']: token = token[1:]

right after the first loop line of nmf.explore() method: for idx, token in enumerate(self.tokens[input_sequence]): # self.tokens[:-1] if self.config['token_prefix'] is not None and token[0] == self.config['token_prefix']: token = token[1:] type = "input" if idx < self.n_input_tokens else 'output' tokens.append({'token': token, 'token_id': int(self.token_ids[input_sequence][idx]), # 'token_id': int(self.token_ids[idx]), 'type': type, # 'value': str(components[0][comp_num][idx]), # because json complains of floats 'position': idx })

jalammar commented 1 year ago

Yeah, that shouldn't happen. A bunch of tokenizers have a character like Ġ in the beginning of a token to indicate that the token is linked to whatever token comes before them in the sequence. Which is why rendering the output needs to run in tandem with the tokenizer and its settings.