Closed verdimrc closed 1 year ago
It shows one basic data preprocessing. It's up to you to customize it to your dataset and your needs :-)
Got it. Thank you @sgugger for the explanation.
I had similar confusion till I found this post.
This is how I address the issue
def tokenize_function(examples):
assert tokenizer.pad_token is not None
with CaptureLogger(tok_logger) as cl:
output = tokenizer(
examples[text_column_name],
truncation=True,
max_length=block_size,
padding="max_length",
)
# clm input could be much much longer than block_size
if "Token indices sequence length is longer than the" in cl.out:
tok_logger.warning(
"^^^^^^^^^^^^^^^^ Please ignore the warning above - this long input will be chunked into smaller bits"
" before being passed to the model."
)
return output
my take:
def group_texts(examples):
# Concatenate input_ids with EOS token and adjust attention_mask accordingly.
concatenated_input_ids = list(
chain(*[example + [eos_token_id] for example in examples["input_ids"]])
)
concatenated_attention_mask = list(
chain(*[example + [1] for example in examples["attention_mask"]])
)
total_length = len(concatenated_input_ids)
total_length = (total_length // block_size) * block_size
# Split by chunks of block_size.
result = {
"input_ids": [
concatenated_input_ids[i : i + block_size]
for i in range(0, total_length, block_size)
],
"attention_mask": [
concatenated_attention_mask[i : i + block_size]
for i in range(0, total_length, block_size)
],
}
result["labels"] = result["input_ids"].copy()
return result
I can create PR if you want
System Info
Who can help?
@sgugger
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I observe that when running
run_clm.py
with gptj tokenizer, thegroup_texts()
doesn't separate different "document" with a special token (for gptj tokenizer, eos = bos = padding). Is this something I need to handle myself?Snippet from
run_clm.py
:Inspecting
lm_datasets
shows the follows:As shown above, there's no eos or sep token (gptj tokenizer uses
<|endoftext|>
aka 50256 for both) in thelm_datasets
Expected behavior
My understanding from the official tutorial (link), is to separate different documents with a special tokens.