jalammar / ecco

Explain, analyze, and visualize NLP language models. Ecco creates interactive visualizations directly in Jupyter notebooks explaining the behavior of Transformer-based language models (like GPT2, BERT, RoBERTA, T5, and T0).
https://ecco.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.96k stars 167 forks source link

KeyError: 'tokenizer_config' #66

Closed guustfranssensEY closed 2 years ago

guustfranssensEY commented 2 years ago

I am working on integrating my custom model vinai\bertweet-base with Ecco, however I ran into the following issue:

Traceback (most recent call last):
  File "experiment_ecco.py", line 44, in <module>
    nmf_1.explore()
  File "C:\Users\XXXX\anaconda3\envs\disaster_tweets\lib\site-packages\ecco\output.py", line 827, in explore
    }})"""
KeyError: 'tokenizer_config'

I created the lm in the following way:

# loading in tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", normalization=True, use_fast=False)
model = torch.load("bertmodel.pth")
''' 
this model is obtained from training 
AutoModelForSequenceClassification.from_pretrained("vinai/bertweet-base", 
output_hidden_states=True, output_attentions=True, num_labels=2)
'''

model_config = {
    'embedding': 'roberta.embeddings.word_embeddings',
    'type': 'mlm',
    'activations': ['intermediate\.dense'],
    'token_prefix': '',
    'partial_token_prefix': ''
}

lm = LM(model=model, tokenizer=tokenizer, model_name="vinai/bertweet-base",
        config=model_config, collect_activations_flag=True, verbose=True)

tweet = "So running down the stairs was a bad idea full on collided... With the floor ??"
inputs = lm.tokenizer([tweet], return_tensors="pt")
output = lm(inputs)

nmf_1 = output.run_nmf(n_components=8)
nmf_1.explore()

upon further inspection I believe the error comes from the following line:

js = f"""
         requirejs(['basic', 'ecco'], function(basic, ecco){{
            const viz_id = basic.init()

            ecco.interactiveTokensAndFactorSparklines(viz_id, {data},
            {{
            'hltrCFG': {{'tokenization_config': {json.dumps(self.config['tokenizer_config'])}
                }}
            }})
         }}, function (err) {{
            console.log(err);
        }})"""

I could not traceback the origin of tokenizer_config. Therefore I assume it also has to be passed in the model_config for a custom model? If so, this needs to be specified in the docs.

Or could this issue be related in a strange way to #65

guustfranssensEY commented 2 years ago

After checking the config of a supported model e.g. bert

ecco.from_pretrained('bert-base-uncased', activations=True)
lm.model_config

{'activations': ['\\d+\\.output\\.dense'],
 'embedding': 'embeddings.word_embeddings',
 'partial_token_prefix': '##',
 'token_prefix': '',
 'tokenizer_config': {'partial_token_prefix': '##', 'token_prefix': ''},
 'type': 'mlm'}

I found that I had to add the following as tokenizer config

'tokenizer_config': {'partial_token_prefix': '', 'token_prefix': ''}

therefore my full config for the custom model is now:

model_config = {
    'embedding': 'roberta.embeddings.word_embeddings',
    'type': 'mlm',
    'activations': ['intermediate\.dense'],
    'token_prefix': '',
    'partial_token_prefix': '',
    'tokenizer_config': {'partial_token_prefix': '', 'token_prefix': ''},
}

After fixing this, my code is able to obtain the beautifull visuals @jalammar has made :)

P.S. Could the tokenizer_config be added to the documentation?

jalammar commented 2 years ago

Awesome! Thanks for working through this, @guustfranssensEY. The intent was that 'tokenizer_config' is made automatically by the library (so the user doesn't repeat themselves needlessly). Nice catch finding out it's not kicking in when users supply the config object.

I think the direction next is to remove tokenizer_config all together. I've made issue #67 to track this.