Open nokitoino opened 10 months ago
The tokenizer is not meant to deal with C Code or Assembly.
encoding_train = tokenizer( input_train, padding="longest", max_length=max_source_length, truncation=True, return_tensors="pt", ) input_ids_train, attention_mask_train = encoding_train.input_ids.to(device), encoding_train.attention_mask.to(device)
We run into CUDA out of memory (OOM) issues on a dataset of about 60 Mega Bytes. We have to do it chunk by chunk eventually. This is called On-the-fly tokenization if dataset is very large, see https://huggingface.co/blog/how-to-train .
The tokenizer is not meant to deal with C Code or Assembly.
We run into CUDA out of memory (OOM) issues on a dataset of about 60 Mega Bytes. We have to do it chunk by chunk eventually. This is called On-the-fly tokenization if dataset is very large, see https://huggingface.co/blog/how-to-train .