hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.81k stars 4.35k forks source link

Question about text preprocess in examples/language/llama2 and applications/Colossal-LLaMA-2 #5026

Open fancyerii opened 1 year ago

fancyerii commented 1 year ago

Describe the feature

I found both the two examples will truncate text longer than max_length. So we have to segment long text to short ones. For examples/language/llama2, the codes are:

def tokenize_batch_for_pretrain(batch, tokenizer: Optional[LlamaTokenizer] = None, max_length: int = 2048):
    texts = [sample["text"] for sample in batch]
    data = tokenizer(texts, return_tensors="pt", padding="max_length", truncation=True, max_length=max_length)
    data = {k: v.cuda() for k, v in data.items()}
    data["labels"] = data["input_ids"].clone()
    return data

  dataset = load_dataset(args.dataset)
  train_ds = dataset["train"]
  dataloader = prepare_dataloader(
      train_ds,
      batch_size=args.batch_size,
      shuffle=True,
      drop_last=True,
      collate_fn=partial(tokenize_batch_for_pretrain, tokenizer=tokenizer, max_length=args.max_length),
  )

It's obvious that long text (2048/4096) will be truncated. And the default redpajama dataset will have very long text.

The codes for applications/Colossal-LLaMA-2

        dataset = dataset.map(
            function=supervised_tokenize,
            fn_kwargs={"tokenizer": tokenizer, "max_length": args.max_length},
            keep_in_memory=False,
            num_proc=min(len(dataset), cpu_count()),
        )

def supervised_tokenize(
    data_point: Dict[str, str], tokenizer: LlamaTokenizer, ignore_index: int = None, max_length: int = 4096
) -> Dict[str, Union[int, str, List[int]]]:
    """
    A tokenization function to tokenize an original pretraining data point as following:
        {"source": "", "target": "Beijing, the capital of the People's Republic of China, ...", "category": "geography"}
    """
    assert tokenizer.add_bos_token is False and tokenizer.add_eos_token is False, (
        "Initially set `tokenizer.add_bos_token` and `tokenizer.add_eos_token` to False, "
        "add <bos> and <eos> manually later"
    )
    if ignore_index is None:
        ignore_index = IGNORE_INDEX

    source_text = data_point["source"]  # `str`
    target_text = data_point["target"]  # `str`
    is_null_source = len(source_text) == 0

    source_text = tokenizer.bos_token + source_text
    target_text += tokenizer.eos_token
    sequence_text = source_text + target_text

    tokenized = tokenizer([source_text, sequence_text])["input_ids"]
    sequence_input_ids = tokenized[1]
    sequence_labels = deepcopy(sequence_input_ids)

    source_length = len(tokenized[0])
    if not is_null_source:
        sequence_labels[:source_length] = [ignore_index for _ in range(source_length)]

    # sequence truncation.
    if len(sequence_input_ids) > max_length:
        sequence_input_ids = sequence_input_ids[:max_length]
        sequence_labels = sequence_labels[:max_length]

    return dict(
        input_ids=sequence_input_ids,
        labels=sequence_labels,
        seq_length=len(sequence_input_ids),
        seq_category=data_point["category"],
    )

it's also truncated by post processing codes.

Orion-Zheng commented 1 year ago

Yes.😃Our consideration is that we can segment documents in advance according to some rules. For instance, for books (long data) in Redpajama, we can split them into different chapters/paragraphs before tokenization. I can also see your point, you think we should implement a packing dataset that will automatically packing short data and split long data during tokenization process. I think both are practical implementation.

fancyerii commented 1 year ago

yes, I think it should be documented clearly that users should segment their inputs or else their data will be truncated.