karpathy / nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.
MIT License
34.86k stars 5.36k forks source link

Nan's training with 'MPS' #217

Open dbl001 opened 1 year ago

dbl001 commented 1 year ago

I trained nanoGPT for 200,000 epochs on a ~5gb dataset of COVID-19 research papers (from here):

https://allenai.org/data/cord-19

I am running 'MPS' on an Intel 2021 iMac 27" with an AMD Radeon Pro 5700 XT GPU.

With device=CPU training completed with no 'Nans'. With device='MPS' logits and loss would be 'nan' after some number epochs (typically a few thousand.) fused=False in AdamW

 optimizer = torch.optim.AdamW(optim_groups, fused=False, lr=learning_rate, betas=betas, **extra_args)

I tracked the 'nans' to the layerNorm in the Block Class.

self.ln_1 = LayerNorm(config.n_embd, bias=config.bias)

if the LayerNorm tensor had any '-Inf,' subsequent computations led to 'nans' applying block(x) when iterating through the ModuleList of the GPT Class. This led to 'nans' in the logits and loss.

Curiously, when I saved the tensors with '-Inf' or 'Nans' to a file, training was able to complete 200,000 epochs without 'nans' in logits and loss. Adding the tests for '-Inf' and 'nan' slowed down the epoch training time.

Here's 200,000 epochs with device='MPS' while saving the errant tensors:

iter 199999: loss 6.3397, time 324.43ms, mfu 0.01%
step 200000: train loss 6.4941, val loss 5.8301
saving checkpoint to out
iter 200000: loss 5.6057, time 496.72ms, mfu 0.01%

Here is where I saved the tensors in model.py:

       i = 0
        for block in self.transformer.h:
            x = block(x)
            nan_mask = torch.isnan(x)
            if nan_mask.any():
                print("x = block(x):  ", block, x)
                fileName = "/Users/davidlaxer/nanoGPT/out/x_%d_nan.pt" % (i)
                torch.save(x, fileName)
                i = i+1

            inf_mask = torch.isinf(x)
            if inf_mask.any():
                print("x = block(x):  ", block, x)
                fileName = "/Users/davidlaxer/nanoGPT/out/x_%d_inf.pt" % (i)
                torch.save(x, fileName)
                i = i + 1

Here are the saved tensors( probably after ~100,000 epochs). After saving, training continued without 'nans' in logits or loss.

% ls -lrt out/x*.pt
-rw-r--r--  1 davidlaxer  staff  66283 Mar 22 09:04 out/x_0_nan.pt
-rw-r--r--  1 davidlaxer  staff  66283 Mar 22 09:04 out/x_1_inf.pt
-rw-r--r--  1 davidlaxer  staff  66283 Mar 22 09:04 out/x_2_nan.pt
-rw-r--r--  1 davidlaxer  staff  66283 Mar 22 09:04 out/x_3_inf.pt
-rw-r--r--  1 davidlaxer  staff  66283 Mar 22 09:04 out/x_4_nan.pt
-rw-r--r--  1 davidlaxer  staff  66283 Mar 22 09:04 out/x_5_inf.pt
-rw-r--r--  1 davidlaxer  staff  66283 Mar 22 09:04 out/x_6_nan.pt
-rw-r--r--  1 davidlaxer  staff  66283 Mar 22 09:04 out/x_7_inf.pt

In [4]: torch.load("out/x_2_nan.pt")
Out[4]: 
tensor([[[        nan,         nan,         nan,  ...,         nan,
                  nan,         nan],
         [-8.5923e-02, -1.5751e-01, -4.4399e-02,  ...,        -inf,
                 -inf,        -inf],
         [        nan,         nan,         nan,  ...,         nan,
                  nan,         nan],
         ...,
         [        nan,         nan,         nan,  ...,         nan,
                  nan,         nan],
         [        nan,         nan,         nan,  ...,         nan,
                  nan,         nan],
         [        nan,         nan,         nan,  ...,         nan,
                  nan,         nan]],

        [[-7.0925e-02, -4.2812e-02, -7.7797e-02,  ...,  3.4136e-02,
          -8.8976e-02, -3.3967e-02],
         [        nan,         nan,         nan,  ...,         nan,
                  nan,         nan],
         [ 4.7617e-03,  7.5204e-02, -5.4484e-04,  ...,  3.0971e-02,
          -2.6159e-03, -7.7905e-02],
         ...,

The output from 'summary.py' doesn't look great. E.g. -

% RUST_BACKTRACE=full  python sample.py --out_dir=out --device='cpu' --compile=False

Overriding: out_dir = out
Overriding: device = cpu
Overriding: compile = False
number of parameters: 3.42M
No meta.pkl found, assuming GPT-2 encodings...

The pandemic, the main measures were have also likely to we may have been used to their clinical studies who have performed an effective in SARS-CoV-2 . . These study has a high-2-2 was used in the disease, and the most of the "2 in a cell-tacal, compared to the S1 and no positive effect of the current study. The COVID-19 can have been in the SARS-CoV-2. The results of the results of the 3. The authors had at the COVID-19 pandemic. The other other studies are "in such as the present study we use of the other-related pandemic [ 0. The COVID-19 disease in the H11-6/ inch and DP and 3. These would be noted that the case of the SARS-CoV-2. This had a major sub-1 is for pre-based data, (P) and there is the number of a single-andbid virus and in the first test. For three T-level other pathization's more than the following hospital period in the first patients with P2, and/2 are shown in a virus of the use of the same other study of the risk of symptoms. In addition, the COVID-19 pandemic, the most effective studies are not be to the first increase of the following hand, and all three other regions of patients and the data, the role in these results may have been used to be more than to not provide only available-ir and use of the number of the end of the cases. There is been possible by a general analysis that the B-orin was of a new treatment of the protein in the pathogen-orbid health.
The study-based study of B = 0.1 and p than the study on the A was been related to be also shown to the number of the COVID-19 pandemic is not be), and the patient-19, in the; one of a risk of the G ) which is considered a complex. The significant respiratory. These data are not from the COVID-19 pandemic (17%) under a two group of the high-and al. (Fig. 1B) . . The results of the findings were, which is the second-tatumation and an increased more than on this study-based, the high-term virus was well
---------------

The study, including the results in p = 0.a) had a high sub-I (F.A) and were not be common to a more likely than the P −1-p cells (D), has been well-specific [1, 4 In the first, the most of the other and 1-10 after the population of the study with the study, H0.: The SARS-CoV-2% that were from the pandemic is a significant study, a higher than health as in the pandemic. In the patient study, we also important to the non-based with the first two-ing (20x7) . In the virus, the other results of the high-specific patients is not have a higher positiveV-F-2-in or protein in general health care has been) with well-L (Rx =8. 1 ). The
The a general preprint in B were compared to the most important by COVID-19 as a different T infection of the main level of a more than an increase in the path] . The total of the pandemic and the pandemicemic is observed in the general study, the public health of the end of COVID-19 pandemic, we suggest that the first health activity of the COVID-19 pandemic, the role of the same study, including D-, Pd, and the end of V1, with cell-6-21, and p- 0.4 h1, the M41 is very a low type of B were significantly higher in the SARS-CoV-2. To patients with this study, the risk of SARS-CoV as such as a losos e-2-2 cases, 1, and 1, in the potential analysis that our study is a new role in different activity of CD5 to the health and the N-1, the T-2 is increased in patients. The major in a higher in the SARS-CoV-2 years (Fig. 2C). The model is no significant significant significant from the SARS-CoV; T-B and C- and s1-2 cells was more al., in E than the respiratory H1 and SARS-CoV-2-2 to SARS-2 were shown in a number of E-CoV-T2. This study is associated with the pre-3-C for
---------------

 majority of the development of the other infection. The number of the "P-or an RNA activity of the potential for the respiratory care with their model.
TheThe most shown in its increase in the results of SARS-CoVCoV-2. we are associated with SARS-CoV-2 T-2 infection was also be more than the study. In the SARS-CoV-2-2-2 in all the pre-2 protein can be demonstrated that it can be more number of COVID-19, and SARS-CoV-2 was performed, for different al. For the early health, the studies are used to be observed in a social risk of COVID-19. In the lack of the pandemic [13] . The number of SARS-CoV-CoV-2 was not also an on the vaccine cell of the results of the one of the vaccine's no lower than that the infection of the clinical research of the sample. The results from the studies are not only in the two-fferan in SARS-CoV-2 disease, and are the cases of T-s. Therefore, a T cells are not be in the pandemic. The disease was to 3.2. As a cell cell of the mean is a limited. For the non-I of the two R cells, the analysis, the total of more likely to the first Constitutionalorbid health, and those with the effects of a significant effects of the potential potential and severe clinicalV-2, the SARS-CoV-2 is a positiveV-2. 6, they is that it has been, a role of these studies can be shown that there has been, about other response to be a high-phation and most of the pandemic, with a significant COVID-19 pandemic have been associated at two infection of the same time of R4-9-1 and 1. This is a viral--N-d-i. A-N-c1-2 study of F1 and n-d-2 and cells was also become a higher than the number of the same time in the E--rein-based findings. Therefore, the analysis of the pandemic, the study are no significant for some of the study might be not on a high model. These findings are more were observed for the pre-term studies of a risk response to be
---------------

The by the treatment of the pandemic. It is a "n-6 at the 2-C to the following a most of the treatment of S In case, the pandemic, we, the other hand It is a clinical and high-induced "p and a higher for example, who are not used to the; in particular or it was not not be a major study in the number of COVID-19, which was a high impact of the current health services and the same's studies (b) . However, the public health of there was also a significant, and we found that are further out that the overall pandemic (P 3) were being the time of the use of the other-2. The study of the presence of the high-2 was associated with their cell system. In this study, the first time of the COVID-19 pandemic, it is made an infected with an-19 pandemic [10] . . The evidence of a large study in the non-The we [26] .
The no study, the social study used in both of the SARS-CoV and R be the same same value in the total of the virus-20. Therefore, the COVID-19 pandemic is to the SARS-CoV-2. The other data was a disease than in the overall levels of the COVID-19 viral infection (1% of the data of these, a significant pandemic, and the present study was not used to be performed by the same for the low response of the preprintVID-19 may not observed in the population of the pandemic. The results of two patients with the SARS-CoV-2 (10) and and 5. 10% of the high level of the expression of C = 0. Our first not a significant higher than the COVID-19 pandemic. In the studies, a second case of three data are, the high risk of the virus, these will be) and non- important as a higher respiratory cases for the expression of the same study by the risk of the pandemic over time, a pandemicemic is also demonstrated as shown to COVID-19, this study is the effect of more increased. In addition, we has been also also have been, the currentV-19emic, in the present role of the pandemic. The increase their potential in the current pandemic may be well as in patients in the
---------------

The significant control of the level of cells in RNA patients, the host-specific cells, such as a single-d-based health who are limited in the COVID-19 pandemic from some of the P-19 in the following clinical studies were not only in the disease- of these disease [7] . While the other studies was also no greater used to that these clinical infection was also been; we (Fig. of the time and the more than our study that COVID-19 pandemic. A few potential study can not be been used for the development of the more than the first number of the COVID-19 pandemic, the development of health and the development of the findings of the first number of COVID-19. The presence of the pandemic, the pandemic also likely not a effect of different years, the studies have medRxaviriris in the first model that the viral-rein, but have been reported for the COVID-19 infection. However, the SARS-CoV-2, we shown that the use of SARS-CoV-2 in C3 cells and p.1 wase. The SARS-CoV-2N (M) in the study of cases is not be also in which are given the other test of the COVID-19 patients. The increased rate of the COVID-19 pandemic, the one studies are available in the SARS flick are used in the study-CoV-A (P-1 and N = 0. 3.6% in the treatment of the development of R 0.6 and 1. The patient is a specific number of the COVID-19, the general care size of the SARS-CoV-2. However, this study is in this study to be observed in the COVID-19 of Vs (T-7) and SARS-2 can be; in A lower than 5.4 ) (2, 0.p = 3).
The significant difference in 2020 can be used for the 10. 1.0.0.5.7% (1-D), 7.5% that the 1 (see 3b.8/ of the first increase the D3-2 (i.2) . However, the 3, the pre-n were used to the following all human viral disease and T cells or for both its total rate showed that the treatment and D

Finally, 'summary.py' fails with device='MPS'

(AI-Feynman) davidlaxer@x86_64-apple-darwin13 nanoGPT % RUST_BACKTRACE=full  python sample.py --out_dir=out --device='mps' --compile=False

Overriding: out_dir = out
Overriding: device = mps
Overriding: compile = False
number of parameters: 3.42M
No meta.pkl found, assuming GPT-2 encodings...
/AppleInternal/Library/BuildRoots/c651a45f-806e-11ed-a221-7ef33c48bc85/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:88: failed assertion `[MPSNDArrayDescriptor sliceDimension:withSubrange:] error: the range subRange.start + subRange.length does not fit in dimension[2] (1)'
zsh: abort      RUST_BACKTRACE=full python sample.py --out_dir=out --device='mps' 

Torch:

% pip show torch
Name: torch
Version: 2.0.0a0+gitdcb73aa
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages
Editable project location: /Users/davidlaxer/anaconda3/envs/AI-Feynman/lib/python3.10/site-packages
Requires: networkx, sympy, typing_extensions
Required-by: accelerate, aifeynman, audiolm-pytorch, benepar, captum, minGPT, parallelformers, pytorch-transformers, sentence-transformers, torch-struct, torch-utils, torchaudio, torchtraining-nightly, torchvision, vector-quantize-pytorch, whisper, xformers
dbl001 commented 1 year ago

Issues running 'summary.py' on 'MPS' have been reported:

https://github.com/openai/tiktoken/issues/47

https://github.com/pytorch/pytorch/issues/92752