Open siddhsql opened 1 year ago
I just noticed the docs say:
If batched is
True
andbatch_size
isn > 1
, then the function takes a batch ofn
examples as input and can return a batch withn
examples, or with an arbitrary number of examples.
so maybe this is a bug then.
All the values in a batch must be of the same length. So one solution is dropping all the input columns:
data = data.map(lambda samples: tokenizer(samples["text"], max_length=tokenizer.model_max_length, truncation=True, stride=4, return_overflowing_tokens=True), batched=True, remove_columns=data.column_names)
Another is padding/transforming the input columns to the tokenizer output's length (447).
Feature request
I understand
dataset
provides amap
function. This function in turn takes in a callable that is used to tokenize the text on which a model is trained. Frequently this text will not fit within a models's context window. In this case it would be useful to wrap around the text into multiple rows with each row fitting the model's context window. I tried to do it using this code as example which in turn I have borrowed from here:but running the code gives me this error:
The lambda function I have provided is correctly chopping up long text so it wraps around (and because of this 394 samples become 447 after wrap around) but the dataset
map
function does not like it.Motivation
please see above
Your contribution
I'm afraid I don't have much knowledge to help