NikeeShrestha commented 4 months ago

How do embeddings layers saved from inference notebook in the github and hidden states from the Hugging face inference notebook differ from each other? When I compare the these two outputs for a sequence, they are different. If I want to do downstreams classification task, which one should be best to work with? They both have same dimension.

Github Inference model loading and using:

parameters, forward_fn, tokenizer, config = get_pretrained_model( model_name=model_name, embeddings_layers_to_save=(20,), attention_maps_to_save=((1, 4), (7, 18)), max_positions=26,

output_hidden_states=True,

# If the progress bar gets stuck at the start of the model wieghts download, 
# you can set verbose=False to download without the progress bar.
verbose=True

)

Hugging Face Inference:

outs = agro_nt_model( torch_batch_tokens, attention_mask=attention_mask, encoder_attention_mask=attention_mask, output_hidden_states=True, )

dallatt commented 2 months ago

Hello @NikeeShrestha

What is referred to as embeddings in the output of the model on this github is stictly equivalent to the hidden_states in the output of the model on Hugging Face. HuggingFace will return the embeddings coming out of every transformer block while here you specify the ones you want.

If you compare the last hidden state out of the HF model with the embedding from the 40th layer of the agro_nt model, you should get the same value!

Do not hesitate if you have any other questions :)

hongruhu commented 2 months ago

Hi just wanted to ask a follow-up question:

I was wondering when using the hidden_state from Hugging Face models, would the embedding be the output from the last layer?

For example, (1) for the 500m human model, the embedding should be output['hidden_state'][-1] (25th layer including the 1st encoding layer) with the shape of [batch_size, max_length, 1280], (2) for the 2.5B multi species model, it should be output['hidden_state'][-1] (33th layer) with the shape of [batch_size, max_length, 2560].

If so, I was wondering why the readme file of the github main page used '20' for the 500m human model:

# Get pretrained model
parameters, forward_fn, tokenizer, config = get_pretrained_model(
    model_name="500M_human_ref",
    embeddings_layers_to_save=(20,),
    max_positions=32,
)
forward_fn = hk.transform(forward_fn)

For the embedding, should we use the [max_length, 1280] 2d matrix as each sequence's embedding? or should we average on the max_length dimension to make each sequence's embedding become a 1280-element vector?

dallatt commented 2 months ago

Hello @hongruhu ,

The 20th layer is an arbitrary choice as embedding of intermediate layers can also be interesting to use. Indeed if you want the final embedding layer of the "500m human model", since there are 24 layers, you should use embeddings_layers_to_save=(24,).

As for the representation, a very common practice is to average the embeddings of the tokens across the sequence length dimension! You can an example of this in the example notebook here.

Best regards, Hugo

instadeepai / nucleotide-transformer

Embeddings and hidden states of Agro-NT model (New to this field so please excuse my question if it is really naive/stupid.) #70

output_hidden_states=True,