Open alex-ber opened 1 month ago
@SunMarc is there a reason why get_wikitext2
is different than the other methods ?
Not sure. This was something TheBloke coded back then.Maybe this is because data[i]["text"]
is pretty long so it takes to while to find a text < seqlen ?
Token indices sequence length is longer than the specified maximum sequence length for this model (73218 > 2048). Running this sequence through the model will result in indexing errors
This does not happen as we are slicing the tokenized data after:
i = random.randint(0, enc.input_ids.shape[1] - seqlen - 1)
j = i + seqlen
inp = enc.input_ids[:, i:j]
attention_mask = torch.ones_like(inp)
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
Produce warning:
Token indices sequence length is longer than the specified maximum sequence length for this model (73218 > 2048). Running this sequence through the model will result in indexing errors
Expected behavior
This is proposed fix:
Inspired by
get_c4`` and
get_c4_new```.No warning is produced.