MultiGPULONGT5AssemblyC.ipynb

The tokenizer is not meant to deal with C Code or Assembly.

Eventually switch to BPE (Tokenizer on Bytes Level, Byte-Pair-Encoding) and train the tokenizer from scratch
Coverage metrics needed, are there unknown Tokens?
We tokenize the whole dataset at once, this consumes too much GPU RAM:

encoding_train = tokenizer(
    input_train,
    padding="longest",
    max_length=max_source_length,
    truncation=True,
    return_tensors="pt",
)
input_ids_train, attention_mask_train = encoding_train.input_ids.to(device), encoding_train.attention_mask.to(device)

We run into CUDA out of memory (OOM) issues on a dataset of about 60 Mega Bytes. We have to do it chunk by chunk eventually. This is called On-the-fly tokenization if dataset is very large, see https://huggingface.co/blog/how-to-train .

nokitoino / DecompilerAI

MultiGPULONGT5AssemblyC.ipynb #5