nokitoino / DecompilerAI

Converting Assembly back to C Code using Transformers.
GNU General Public License v3.0
25 stars 3 forks source link

MultiGPULONGT5AssemblyC.ipynb #5

Open nokitoino opened 10 months ago

nokitoino commented 10 months ago

The tokenizer is not meant to deal with C Code or Assembly.

  1. Eventually switch to BPE (Tokenizer on Bytes Level, Byte-Pair-Encoding) and train the tokenizer from scratch
  2. Coverage metrics needed, are there unknown Tokens?
  3. We tokenize the whole dataset at once, this consumes too much GPU RAM:
encoding_train = tokenizer(
    input_train,
    padding="longest",
    max_length=max_source_length,
    truncation=True,
    return_tensors="pt",
)
input_ids_train, attention_mask_train = encoding_train.input_ids.to(device), encoding_train.attention_mask.to(device)

We run into CUDA out of memory (OOM) issues on a dataset of about 60 Mega Bytes. We have to do it chunk by chunk eventually. This is called On-the-fly tokenization if dataset is very large, see https://huggingface.co/blog/how-to-train .