Should group_text in run_clm.py separate documents with special tokens?

verdimrc commented 1 year ago

System Info

transformers version: 4.28.1
platform: OSX Ventura 13.3.1 (M1)
python version: 3.11.3

Who can help?

@sgugger

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I observe that when running run_clm.py with gptj tokenizer, the group_texts() doesn't separate different "document" with a special token (for gptj tokenizer, eos = bos = padding). Is this something I need to handle myself?

Snippet from run_clm.py:

from datasets import load_dataset

def tokenize_function(examples, text_column_name="text"):
    ...

# Main data processing function that will concatenate all texts from our dataset and generate chunks of block_size.
def group_texts(examples):
    ...

raw_datasets = load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:5]")
tokenized_datasets = raw_datasets.map(
    tokenize_function,
    batched=True,
    remove_columns=list(raw_datasets.features),
)

block_size = 8
lm_datasets = tokenized_datasets.map(group_texts, batched=True)

Inspecting lm_datasets shows the follows:

>>> print(raw_datasets['text'])
['', ' = Valkyria Chronicles III = \n', '', ' Senjō no Valkyria ...', ...]

>>> print(tokenized_datasets['input_ids'])
[[], [796, 569, 18354, 7496, 17740, 6711, 796, 220, 198], [], [2311, 73, 13090, 645, 569, 18354, 7496, ...], ...]

>>> print(lm_datasets['input_ids'])
[[796, 569, 18354, 7496, 17740, 6711, 796, 220], [198, 2311, 73, 13090, 645, 569, 18354, 7496], ...]

As shown above, there's no eos or sep token (gptj tokenizer uses<|endoftext|> aka 50256 for both) in the lm_datasets

Expected behavior

My understanding from the official tutorial (link), is to separate different documents with a special tokens.

sgugger commented 1 year ago

It shows one basic data preprocessing. It's up to you to customize it to your dataset and your needs :-)

verdimrc commented 1 year ago

Got it. Thank you @sgugger for the explanation.

liaocs2008 commented 1 year ago

I had similar confusion till I found this post.

This is how I address the issue

    def tokenize_function(examples):
        assert tokenizer.pad_token is not None

        with CaptureLogger(tok_logger) as cl:
            output = tokenizer(
                examples[text_column_name],
                truncation=True, 
                max_length=block_size,
                padding="max_length",
            )
        # clm input could be much much longer than block_size
        if "Token indices sequence length is longer than the" in cl.out:
            tok_logger.warning(
                "^^^^^^^^^^^^^^^^ Please ignore the warning above - this long input will be chunked into smaller bits"
                " before being passed to the model."
            )
        return output

eryk-mazus commented 4 months ago

my take:


    def group_texts(examples):
        # Concatenate input_ids with EOS token and adjust attention_mask accordingly.
        concatenated_input_ids = list(
            chain(*[example + [eos_token_id] for example in examples["input_ids"]])
        )
        concatenated_attention_mask = list(
            chain(*[example + [1] for example in examples["attention_mask"]])
        )

        total_length = len(concatenated_input_ids)
        total_length = (total_length // block_size) * block_size

        # Split by chunks of block_size.
        result = {
            "input_ids": [
                concatenated_input_ids[i : i + block_size]
                for i in range(0, total_length, block_size)
            ],
            "attention_mask": [
                concatenated_attention_mask[i : i + block_size]
                for i in range(0, total_length, block_size)
            ],
        }
        result["labels"] = result["input_ids"].copy()
        return result

I can create PR if you want

huggingface / transformers