karpathy / nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.
MIT License
34.76k stars 5.35k forks source link

Training nanoGPT on COVID-19 Dataset #391

Open dbl001 opened 7 months ago

dbl001 commented 7 months ago

I am training nanoGPT on a dataset with ~800,000 COVID-19 research papers on an A100 GPU.

--n_layer=48 --n_head=8 --n_embd=64 --device='cuda' --compile=False --eval_iters=1 --block_size=2048 --batch_size=16 --max_iters=10000

I can't get the loss to go any lower the ~5. The generated output looks like a COVID-19 research paper but is nonsensical, contains duplicate adjacent phrases, etc.

Here's an example input line:

The benefits and the disadvantages of technology-mediated teaching and learning became a focal point for university research in the context of the COVID-19 crisis (Kamarianos et al., 2020; Karalis and Raikou, 2020; Owusu-Fordjour et al., 2020; Shah et al., 2020) . However, this topic is not new but one of the central research focuses in the context of learning in digital learning environments. Davis and Wong (2007) define e-learning as a global phenomenon for organizations and educational institutions, aiming to enhance students' learning experience and effectiveness in terms of the learning outcome. The benefits of e-learning have been discussed in recent research, but so far, there is no consensus on whether the outputs of e-learning are more effective than those of traditional learning formats (Derouin et al., 2005) . The most frequently stated benefits are cost efficiency, flexibility (in terms of time and place), saving time to travel to the learning location, easy access to learning materials, as well as the usefulness of learning materials for a longer period (Welsh et al., 2003; Brown et al., 2006; Hameed et al., 2008; Jefferson and Arnold, 2009; Hill and Wouters, 2010; Al-Qahtani and Higgins, 2013; Becker et al., 2013) , or the potential to offer personalized learning according to the learner's specific needs (Berge and Giles, 2006) . On the negative side, technology-mediated learning lacks direct social interaction and a personal touch and has the potential to socially isolate the learner or at least to negatively influence social aspects of learning processes (Gimson and Bell, 2007; Hameed et al., 2008; Al-Qahtani and Higgins, 2013; Becker et al., 2013) . Socially isolated learning can negatively influence the development of learners' communication skills, as well as change communication conditions, including the lack of support and feedback using non-verbal cues or by observing the interactions of others, as well as the lack of social and cognitive presence and teacher's involvement (Al-Qahtani and Higgins, 2013) . Furthermore, learners are insecure about their learning in the absence of regular contact to the teachers (Al-Qahtani and Higgins, 2013). Technology-mediated teaching and learning requires self-motivation, time management and a focused approach and self-directed learning and organization skills of learners (Hameed et al., 2008; Jefferson and Arnold, 2009 ). According to Al-Qahtani and Higgins (2013), these requirements arise partly from the conditions of social isolation and lack of direct social interaction, which means that the learner must have a relatively strong motivation to mitigate this effect. During the lockdown of the universities the expectation was that most of the young students will not have any difficulty in switching to online teaching, which is indeed confirmed by actual findings (e.g., Kamarianos et al., 2020) . Shah et al. (2020) point out the numerous and immediately apparent benefits of transferring learning to the virtual world: free exchange of information, access to lectures and presentations at conferences that used to involve considerable travel costs, webinars and online discussions, reduction of time inefficiency associated with travel and increased commitment. Owusu-Fordjour et al. (2020) identify negative effects, e.g., learning at home can be ineffective because of many distractions, no adequate learning environment, or contact with the teacher. Less problems have been found in switching to online teaching, however, on the negative side, technical obstacles as well as lack of communication and cooperation, difficulties to concentrate, too many screen-time, lack of logistical infrastructure, non-physical presence, more workload and the loss of lab courses and the general restriction of social contact have been pointed out as important during the crisis. To the positive characteristics belong the easy participation in class, time savings, home comfort, the possibility to learn, new competences, attending and learning flexibility.

Here's an example of generated output with prompt (e.g. The HIV-1 genomic RNA (gRNA) has three major functions...):

Overriding: out_dir = out
Overriding: device = mps
Overriding: compile = False
Overriding: start = The HIV-1 genomic RNA (gRNA) has three major functions
Overriding: num_samples = 5
number of parameters: 4.01M
No meta.pkl found, assuming GPT-2 encodings...
The HIV-1 genomic RNA (gRNA) has three major functions of the most of the infection in the work of the presence of two of the the of the that a studies of the effect of the study in the treatment of the-19 (10. The is a COVID-R-2 , and the need in-19 in the care of the all the same [see38 and P: the significant is the role in the the risk are 1. The need of-V 1.g.11] . This studies and (9 analysis of the high-or, were on the 3. 2 and is also been a study of the COVID-2 (3 days of the each the use of the other clinical and by RF is the specific results.R). . The results was are a significant in the than that compared to understand the patients. In the study.19.e.5. The different set of the immune of COVID-1.e. The study risk of the first expression, such as a different, and the an number of the model were not been a model. These is are that the first, or the results.B and the social population and a significant (p. Since the effect, and the most of the study, and an sub-19. The virus and the "3] , and-19 and the primary to the and the first in the for the is used by a- . The expression for the response to have.e, and as SARS-2 was identified a four virus, the presence for social for the as a pre-19-fFig. We also have-2. The virus, the treatment of the study.6] . To we have they has also have a an C-19, and the SARS-C, they is high-CoV-19, and inter-Co , the immun activity.5 is the same data, R was a potential [2] . As the COVID-19 analysis.5 and the the same of the levels of the context of these level of two-nPC. The viral effect of control of the number of the low impact of the three-CoVID-and 1.4 or in the need, social of this (1 (associated types of the medical, and the data have been been been data, the two years of the a negative of the result of the study.5 groups, and other for COVID-12) . This and a pandemic. During the human-
---------------
The HIV-1 genomic RNA (gRNA) has three major functions in the those in this study. TheVID-19 and the studies to Sin [19, and its countries, or the global immun-19, which that the current than the study of the disease models of a development of COVID-B, the virus in the second, such as the relationship in the all the low patients with a greater in the end of the human-or, and the recent major COVID--1.5.35, , with the at the large study, the human/1. The model of the difference of the role, and the lower pandemic, and the major. Several study of three results (8. The (2, could been noted in the information, the COVID-2ase; 2020) associated to our human. We reported in the disease, which-T. We is more two-20] . The research group. The study of significant pand-19 (22. In the fact. The number of the value of the study of the case of the development of the difference by a a time of the two of the E/4) and the a not not when P2-19. The be a the world, our analysis of the different cell-19, which and the population of the model' and a first to an the authors or the data. The study that the potential in the results in the most.5) [5) . The than in the is an in the information into the spread in the pand In the health level, for the and H18] .6, its a time to a current. This in the pand in the previous study, (1) of the "31) . In the or the study on the context of the than the S In the second, the or data were can be a al can ") . In the use of the results by the higher levels of the immune: patients and a a high-E) were are also is the studies of the use of pandemic within clinical and the GThe the development of the effective to the clinical number of the of the C-19.5, as such with the fact of the number of the pandemic and a higher--19 and they. In the use at the Rone-C) of the findings has 10 in the immune-based cases is that of the study can be is the levels of the 1 (19 pand to the 0. The non-R% of the one of

The loss stopped decreasing at ~5,000 iterations but I continued until ~7,300. Here are my parameters from train.py:

gradient_accumulation_steps = 5 # used to simulate larger batch sizes
batch_size = 12 # if gradient_accumulation_steps > 1, this is the micro-batch size
block_size = 1024
# model
n_layer = 48
n_head = 8
n_embd = 768
dropout = 0.0 # for pretraining 0 is good, for finetuning try 0.1+
bias = False # do we use bias inside LayerNorm and Linear layers?
# adamw optimizer
learning_rate = 1e-5 # max learning rate
max_iters = 10000 # total number of training iterations
weight_decay = 1e-1
beta1 = 0.9
beta2 = 0.99
grad_clip = 1.0 # clip gradients at this value, or disable if == 0.0
# learning rate decay settings
decay_lr = True # whether to decay the learning rate
warmup_iters = 1000 # how many steps to warm up for
lr_decay_iters = 3000 # should be ~= max_iters per Chinchilla
min_lr = 5e-5 # minimum learning rate, should be ~= learning_rate/10 per Chinchilla
# DDP settings
backend = 'nccl' # 'nccl', 'gloo', etc.
# system
device = 'cuda' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1' etc., or try 'mps' on macbooks
dtype = 'bfloat16' # 'float32', 'bfloat16', or 'float16', the latter will auto implement a GradScaler
compile = True # use PyTorch 2.0 to compile the model to be faster

Here's train.py parameters:

!python train.py --init_from='resume'  --dataset=covid --n_layer=48 --n_head=8 --n_embd=64 --device='cuda' --compile=False --eval_iters=1 --block_size=2048 --batch_size=16 --max_iters=10000

Overriding: init_from = resume
Overriding: dataset = covid
Overriding: n_layer = 48
Overriding: n_head = 8
Overriding: n_embd = 64
Overriding: device = cuda
Overriding: compile = False
Overriding: eval_iters = 1
Overriding: block_size = 2048
Overriding: batch_size = 16
Overriding: max_iters = 10000
Resuming training from out
number of parameters: 5.58M
num decayed parameter tensors: 194, with 5,709,824 parameters
num non-decayed parameter tensors: 97, with 6,208 parameters
using fused AdamW: True
...
iter 7369: loss 5.0866, time 6824.75ms, mfu 6.72%
iter 7370: loss 5.3131, time 6811.68ms, mfu 6.72%
iter 7371: loss 5.0493, time 6812.17ms, mfu 6.72%

wandb: 
wandb: Run history:
wandb:       iter ▁▂▃▅▆▇█
wandb:         lr ▁▂▃▄▆▇█
wandb:        mfu ▁██████
wandb: train/loss ██▁▆▆█▅
wandb:   val/loss ▆▆▃█▁▆█
wandb: 
wandb: Run summary:
wandb:       iter 7300
wandb:         lr 1e-05
wandb:        mfu 6.71817
wandb: train/loss 5.09996
wandb:   val/loss 5.23053
Screenshot 2023-11-18 at 11 04 40 AM Screenshot 2023-11-18 at 11 04 20 AM Screenshot 2023-11-18 at 11 03 54 AM
  1. Dropouts made things worse.
  2. Smaller learning_rate didn't help.
  3. I tried various values for n_layer, n_head, batch_size and block_size. The current parameters got the loss down to ~5.

Questions: Is the input too complex for the model?
sciBERT and BioBERT can handle scientific papers. Should I try a different tokenizer (other then tiktoken)? Should I try a different optimizer (other then AdamW)?

VatsaDev commented 7 months ago

Dude you are on a single a100, you need more to scale. You've looked at every parameter other than model size.

Also do you have a link to the dataset? might need filtering for metadata removal, and formatting.

A 5 million parameter model is below toy level. I'm surprised you have a loss of 5 on scientific papers.

Let's look at the examples you've cited. Biobert, I couldn't find concrete information on parameter size, but its distilled version, compact biobert, https://huggingface.co/nlpie/compact-biobert, is a 65m model, 13x bigger. Comparing the sizes of model weights, as both are from fp32 era, the real model is twice as large as the distil, approx. ~130m model, 26x bigger than you're current example.

also @dbl001 i recommend using karpathys Llama2c, its practically the same as NanoGPT, but based on the more modern Llama architecture and integrates better with the current ecosystem, if you want people to use your research. it has better inference, quants, and hf conversions.

You can also train a custom tokenizer on llama2c, one more suited for your data.

I trained a model of the same size with llama2c with tinystories dataset and got a loss of ~2.

dbl001 commented 7 months ago

Thanks! I'll take a look.

https://github.com/allenai/cord19

VatsaDev commented 7 months ago

Ah I see, The dataset should be fine, and the llama2 tokenizer should work, but you would need to change dataloader to tokenize the pdfs

dbl001 commented 7 months ago

llama2.c hasn't done much better then nanoGPT:

 ./run out/model.bin -i "The benefits and the disadvantages of technology-mediated teaching and learning "
The benefits and the disadvantages of technology-mediated teaching and learning 1 (Sioned and hypoth planned production study) in these logic, only food nanolure attendance emerging movement one. Nevertheless, lai mortality technological capsends is presented in ax tilergism (GLA) confirmed the cargravined samples of patients with Dspough, reliable and absolute virtual occurrence of input, and pres objects in privateize those trends to have information supporting a telenecology. Using a more sample will be made available with disuity sharing, and this is not worldwide. Methods: The most likelylications for repeated inpi vertical people with lower into, related to admission may be used to ensure inter-existing number of most of these clinical could be an minimally considered long. However, the department diversity of concerns' following a week it has not been suggested due to the initial injury of non-incing ad appear to be from a major severe cause. We have had any received history of thrombosis related to spreading and a apparent immune system undergoing cancer progress. Despite these employees, a sor et al. (2020) a successful load defined by
achieved tok/s: 102.947113

Model parameters:

out_dir = "out"
eval_interval = 100
log_interval = 10
eval_iters = 100
eval_only = False  # if True, script exits right after the first eval
always_save_checkpoint = True  # if True, always save a checkpoint after each eval
init_from = "scratch"  # 'scratch' or 'resume'
# wandb logging
wandb_log = True      # disabled by default
wandb_project = "llamac"
wandb_run_name = "run" + datetime.now().strftime("%Y_%m_%d_%H_%M_%S")
# data
batch_size = 8  # if gradient_accumulation_steps > 1, this is the micro-batch size
max_seq_len = 1024
vocab_source = "llama2" # llama2|custom; use Lllama 2 vocab from Meta, or custom trained
vocab_size = 32000 # the Llama 2 tokenizer has 32K tokens
# model
dim = 288
n_layers = 16
n_heads = 8
n_kv_heads = 8
multiple_of = 32
dropout = 0.0
# adamw optimizer
gradient_accumulation_steps = 4  # used to simulate larger batch sizes
learning_rate = 5e-5  # max learning rate
max_iters = 5000  # total number of training iterations
weight_decay = 1e-1
beta1 = 0.9
beta2 = 0.95
grad_clip = 1.0  # clip gradients at this value, or disable if == 0.0
# learning rate decay settings
decay_lr = True  # whether to decay the learning rate
warmup_iters = 500  # how many steps to warm up for
# system
device = "mps"  # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1' etc., or try 'mps' on macbooks
dtype = "bfloat16"  # float32|bfloat16|float16
compile = False  # use PyTorch 2.0 to compile the model to be faster

Learning peaked with loss ~4 after ~2800 iterations:

/Users/davidlaxer/anaconda3/envs/AI-Feynman/bin/python /Users/davidlaxer/llama2.c/train.py 
tokens per iteration will be: 32,768
breaks down as: 4 grad accum steps * 1 processes * 8 batch size * 1024 max seq len
Initializing a new model from scratch
num decayed parameter tensors: 113, with 25,141,248 parameters
num non-decayed parameter tensors: 33, with 9,504 parameters
using fused AdamW: False
...
step 2800: train loss 4.4106, val loss 4.3582
saving checkpoint to out
wrote out/model.bin
2800 | loss 4.3785 | lr 2.412751e-05 | 81919.56ms | mfu 0.33%
2810 | loss 4.1788 | lr 2.395311e-05 | 7445.09ms | mfu 0.33%
2820 | loss nan | lr 2.377876e-05 | 153206.26ms | mfu 0.30%
2830 | loss nan | lr 2.360446e-05 | 6208.07ms | mfu 0.30%
2840 | loss nan | lr 2.343024e-05 | 21007.32ms | mfu 0.28%

Learning peaked with a loss of 4:

Screenshot 2023-11-26 at 8 02 30 AM
VatsaDev commented 7 months ago

hmm 25m params, but also a max sequence len of 1024, try lowering to 512? Also, a +1 loss is pretty good at that scale, but

Also if you're using an a100 why is device = mps and torch.compile off? Also try increasing batchsize and grad accum, training runs tend to go for 0.5m tokens a step

dbl001 commented 7 months ago

I have tried training on 'mps' on an AMD Radeon Pro 5700XT an A100 on Google Colab Pro. With layers=16 and heads=8 I the loss leveled at ~5 and the results were nonsensical. So I tried {'dim': 4096, 'multiple_of': 256, 'n_heads': 32, 'n_layers': 32, 'norm_eps': 1e-05, 'vocab_size': -1} on an RTX A6000 which runs out of memory:

Traceback (most recent call last):
  File "/content/llama2.c/train.py", line 312, in <module>
    scaler.scale(loss).backward()
  File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacty of 39.56 GiB of which 52.56 MiB is free. Process 263789 has 39.51 GiB memory in use. Of the allocated memory 37.80 GiB is allocated by PyTorch, and 336.39 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Next, I'll try an H100. On 'mps' compile=True fails and compile=True failed on the V100 and A100 on Colab.

VatsaDev commented 7 months ago

{'dim': 4096, 'multiple_of': 256, 'n_heads': 32, 'n_layers': 32, 'norm_eps': 1e-05, 'vocab_size': -1} - that seems big for an A6000

On 'mps' compile=True fails and compile=True failed on the V100 and A100 on Colab. - thats intersting, check on t4? Raise an issue on that

dbl001 commented 7 months ago

https://github.com/pytorch/pytorch/issues/113521