Long input document is truncated when calculating attribution scores

inseq-team / inseq

Interpretability for sequence generation models 🐛 🔍

https://inseq.org

Apache License 2.0

378 stars 36 forks source link

Long input document is truncated when calculating attribution scores #260

Closed acDante closed 8 months ago

acDante commented 8 months ago

Question

Hi, thanks for your great work! I am trying to run the attribution function on a long input document, but the processed document is truncated according to out[0].source. Does any argument in this function limit the maximal length of the input tokens (e.g. something like max_input_length)? How should I change my codes so that I can get the attribution scores for the whole document?

Here is the relevant code snippet for reproducing this problem:

import datasets

extra_cnn_data = datasets.load_dataset("eReverter/cnn_dailymail_extractive", split="test")
article = extra_cnn_data[1]['src']
doc = " ".join(article)

out = model.attribute(
    doc,
    n_steps=100,
    generation_args={"max_new_tokens": 64},
    internal_batch_size=50
)

gsarti commented 8 months ago

Hi @acDante, could you share also the model type you are using? Some models have a fixed maximum input size (e.g. for GPT-2 n_ctx is 1024 tokens: https://huggingface.co/openai-community/gpt2/blob/main/config.json). On our side, we simply apply padding and truncation according to the model's tokenizer, so it's likely to be a limitation in the model itself. Have you tried to load the model tokenizer with transformers and tokenize your inputs with tokenizer(inputs, padding=True, truncation=True)? I think it's likely to produce the same issue

acDante commented 8 months ago

Thanks for your prompt reply! Yes it is a limitation in the model. I am using flan-t5-base model, whose default window size is 512 tokens. Is it possible to set a larger tokenizer.model_max_length or set tokenizer.truncation=False when loading the model with inseq.load_model()?

gsarti commented 8 months ago

If the model is limited to 512 tokens you most likely don't want to change that because it was trained with that limit, hence it would likely produce garbage when going beyond that size. Modern approaches use methods like ALiBi and RoPE for overcoming that limitation, but Flan-T5 does not support that AFAIK. You might want to use another model in this context.