Presence of character Ġ before each token in output

osmanii commented 1 year ago

I was working on the "05- Neuron Factors.ipynb" notebook and noticed the presence of character Ġ before each token in the output. The output is for the code "nmf_1.explore()". I am not quite sure why it is doing that. Please check the screenshot below.

Your help is appreciated.

cristianestojeda commented 1 year ago

Happened to me using GPT-2 and solved this issue by adding the following line: if self.config['token_prefix'] is not None and token[0] == self.config['token_prefix']: token = token[1:]

right after the first loop line of nmf.explore() method: for idx, token in enumerate(self.tokens[input_sequence]): # self.tokens[:-1] if self.config['token_prefix'] is not None and token[0] == self.config['token_prefix']: token = token[1:] type = "input" if idx < self.n_input_tokens else 'output' tokens.append({'token': token, 'token_id': int(self.token_ids[input_sequence][idx]), # 'token_id': int(self.token_ids[idx]), 'type': type, # 'value': str(components[0][comp_num][idx]), # because json complains of floats 'position': idx })

jalammar commented 1 year ago

Yeah, that shouldn't happen. A bunch of tokenizers have a character like Ġ in the beginning of a token to indicate that the token is linked to whatever token comes before them in the sequence. Which is why rendering the output needs to run in tandem with the tokenizer and its settings.

jalammar / ecco

Presence of character Ġ before each token in output #87