Open dbl001 opened 1 year ago
Dude you are on a single a100, you need more to scale. You've looked at every parameter other than model size.
Also do you have a link to the dataset? might need filtering for metadata removal, and formatting.
A 5 million parameter model is below toy level. I'm surprised you have a loss of 5 on scientific papers.
Let's look at the examples you've cited. Biobert, I couldn't find concrete information on parameter size, but its distilled version, compact biobert, https://huggingface.co/nlpie/compact-biobert, is a 65m model, 13x bigger. Comparing the sizes of model weights, as both are from fp32 era, the real model is twice as large as the distil, approx. ~130m model, 26x bigger than you're current example.
also @dbl001 i recommend using karpathys Llama2c, its practically the same as NanoGPT, but based on the more modern Llama architecture and integrates better with the current ecosystem, if you want people to use your research. it has better inference, quants, and hf conversions.
You can also train a custom tokenizer on llama2c, one more suited for your data.
I trained a model of the same size with llama2c with tinystories dataset and got a loss of ~2.
Thanks! I'll take a look.
Ah I see, The dataset should be fine, and the llama2 tokenizer should work, but you would need to change dataloader to tokenize the pdfs
llama2.c hasn't done much better then nanoGPT:
./run out/model.bin -i "The benefits and the disadvantages of technology-mediated teaching and learning "
The benefits and the disadvantages of technology-mediated teaching and learning 1 (Sioned and hypoth planned production study) in these logic, only food nanolure attendance emerging movement one. Nevertheless, lai mortality technological capsends is presented in ax tilergism (GLA) confirmed the cargravined samples of patients with Dspough, reliable and absolute virtual occurrence of input, and pres objects in privateize those trends to have information supporting a telenecology. Using a more sample will be made available with disuity sharing, and this is not worldwide. Methods: The most likelylications for repeated inpi vertical people with lower into, related to admission may be used to ensure inter-existing number of most of these clinical could be an minimally considered long. However, the department diversity of concerns' following a week it has not been suggested due to the initial injury of non-incing ad appear to be from a major severe cause. We have had any received history of thrombosis related to spreading and a apparent immune system undergoing cancer progress. Despite these employees, a sor et al. (2020) a successful load defined by
achieved tok/s: 102.947113
Model parameters:
out_dir = "out"
eval_interval = 100
log_interval = 10
eval_iters = 100
eval_only = False # if True, script exits right after the first eval
always_save_checkpoint = True # if True, always save a checkpoint after each eval
init_from = "scratch" # 'scratch' or 'resume'
# wandb logging
wandb_log = True # disabled by default
wandb_project = "llamac"
wandb_run_name = "run" + datetime.now().strftime("%Y_%m_%d_%H_%M_%S")
# data
batch_size = 8 # if gradient_accumulation_steps > 1, this is the micro-batch size
max_seq_len = 1024
vocab_source = "llama2" # llama2|custom; use Lllama 2 vocab from Meta, or custom trained
vocab_size = 32000 # the Llama 2 tokenizer has 32K tokens
# model
dim = 288
n_layers = 16
n_heads = 8
n_kv_heads = 8
multiple_of = 32
dropout = 0.0
# adamw optimizer
gradient_accumulation_steps = 4 # used to simulate larger batch sizes
learning_rate = 5e-5 # max learning rate
max_iters = 5000 # total number of training iterations
weight_decay = 1e-1
beta1 = 0.9
beta2 = 0.95
grad_clip = 1.0 # clip gradients at this value, or disable if == 0.0
# learning rate decay settings
decay_lr = True # whether to decay the learning rate
warmup_iters = 500 # how many steps to warm up for
# system
device = "mps" # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1' etc., or try 'mps' on macbooks
dtype = "bfloat16" # float32|bfloat16|float16
compile = False # use PyTorch 2.0 to compile the model to be faster
Learning peaked with loss ~4 after ~2800 iterations:
/Users/davidlaxer/anaconda3/envs/AI-Feynman/bin/python /Users/davidlaxer/llama2.c/train.py
tokens per iteration will be: 32,768
breaks down as: 4 grad accum steps * 1 processes * 8 batch size * 1024 max seq len
Initializing a new model from scratch
num decayed parameter tensors: 113, with 25,141,248 parameters
num non-decayed parameter tensors: 33, with 9,504 parameters
using fused AdamW: False
...
step 2800: train loss 4.4106, val loss 4.3582
saving checkpoint to out
wrote out/model.bin
2800 | loss 4.3785 | lr 2.412751e-05 | 81919.56ms | mfu 0.33%
2810 | loss 4.1788 | lr 2.395311e-05 | 7445.09ms | mfu 0.33%
2820 | loss nan | lr 2.377876e-05 | 153206.26ms | mfu 0.30%
2830 | loss nan | lr 2.360446e-05 | 6208.07ms | mfu 0.30%
2840 | loss nan | lr 2.343024e-05 | 21007.32ms | mfu 0.28%
Learning peaked with a loss of 4:
hmm 25m params, but also a max sequence len of 1024, try lowering to 512? Also, a +1 loss is pretty good at that scale, but
Also if you're using an a100 why is device = mps and torch.compile off? Also try increasing batchsize and grad accum, training runs tend to go for 0.5m tokens a step
I have tried training on 'mps' on an AMD Radeon Pro 5700XT an A100 on Google Colab Pro. With layers=16 and heads=8 I the loss leveled at ~5 and the results were nonsensical. So I tried {'dim': 4096, 'multiple_of': 256, 'n_heads': 32, 'n_layers': 32, 'norm_eps': 1e-05, 'vocab_size': -1} on an RTX A6000 which runs out of memory:
Traceback (most recent call last):
File "/content/llama2.c/train.py", line 312, in <module>
scaler.scale(loss).backward()
File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacty of 39.56 GiB of which 52.56 MiB is free. Process 263789 has 39.51 GiB memory in use. Of the allocated memory 37.80 GiB is allocated by PyTorch, and 336.39 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Next, I'll try an H100. On 'mps' compile=True fails and compile=True failed on the V100 and A100 on Colab.
{'dim': 4096, 'multiple_of': 256, 'n_heads': 32, 'n_layers': 32, 'norm_eps': 1e-05, 'vocab_size': -1} - that seems big for an A6000
On 'mps' compile=True fails and compile=True failed on the V100 and A100 on Colab. - thats intersting, check on t4? Raise an issue on that
I am training nanoGPT on a dataset with ~800,000 COVID-19 research papers on an A100 GPU.
I can't get the loss to go any lower the ~5. The generated output looks like a COVID-19 research paper but is nonsensical, contains duplicate adjacent phrases, etc.
Here's an example input line:
Here's an example of generated output with prompt (e.g. The HIV-1 genomic RNA (gRNA) has three major functions...):
The loss stopped decreasing at ~5,000 iterations but I continued until ~7,300. Here are my parameters from train.py:
Here's train.py parameters:
Questions: Is the input too complex for the model?
sciBERT and BioBERT can handle scientific papers. Should I try a different tokenizer (other then tiktoken)? Should I try a different optimizer (other then AdamW)?