Open soonilbae opened 1 year ago
did you find any solution to this ?
Actually, I've not found it yet. I guess I am too new. Could you please give me step-by-step instructions? I tried it with or without docker, but the result was the same. My environment is 2 * 80GB A100 GPUs.
Thanks
Soonil
2023년 4월 17일 (월) 오후 4:29, Stephen Fernandes @.***>님이 작성:
did you find any solution to this ?
— Reply to this email directly, view it on GitHub https://github.com/microsoft/DeBERTa/issues/127#issuecomment-1510843907, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOTZP55MDJVNI4CRBSXRNZ3XBTWNFANCNFSM6AAAAAAWLUKJYY . You are receiving this because you authored the thread.Message ID: @.***>
@soonilbae did you do python setup.py install
just make a clean installation with all the dependencies if using latest releases of torch make sure you mitigate the torch._six dependency, by directly importing six for string_classes
I actually got my pretraining running, currently running base version with batch_size of 96 as 256 suffers oom errors
@StephennFernandes have you manage to train successfuly? I am training a portugues version. I am getting 67,5% validation ccuracy and 1,55 validation loss, after 70k steps (64 for batch size) but when I import the discriminator weights in Huggingface, with the spm file for the tokenizer, the downstream classification tasks never converge and after 10 epochs the training and validation error dont decrease.
@fmobrj i just ran the rtd.sh pretraining script to ensure everything i working fine. didn't really emphasis on the training metrics.
However gently pinging @BigBird01 as this is been a known issue for many folks
@fmobrj i just ran the rtd.sh pretraining script to ensure everything i working fine. didn't really emphasis on the training metrics.
However gently pinging @BigBird01 as this is been a known issue for many folks
Thanks!
@fmobrj hey curious to know what hparams did you use how big was your training data, and incase i could get to check your training metrics to better understand your problem
@fmobrj hey curious to know what hparams did you use how big was your training data, and incase i could get to check your training metrics to better understand your problem
No problem. I made some ajustments to rtd.sh and app.run, cause I wanted to use the large pretrained english version as starting poin t. It worked: my generator training loss started from 6.58, instead of 10.93 (purely from scratch) and the discriminator from 3.74, intead of 4.19 (scracth).
My Dataset is a concatenation of ptwiki and BRWAC (https://www.inf.ufrgs.br/pln/wiki/index.php?title=BrWaC). After tokenization it ended up somewhere around 7MM examples (of 512 tokens each)
First, I trained a portuguese spm model with this params:
spm.SentencePieceTrainer.train(
input='data_debertav3/train_wiki_brwac.raw',
model_prefix='/home/fmobrj/.~DeBERTa/assets/latest/deberta-v3-large-pt/spm',
vocab_size=128000,
character_coverage=1.0,
model_type='unigram',
input_sentence_size=7000000,
unk_id=3,
pad_id=0,
bos_id=1,
eos_id=2,
unk_piece='[UNK]',
pad_piece='[PAD]',
bos_piece='[CLS]',
eos_piece='[SEP]',
user_defined_symbols=[])
Then I tokenized the raw dataset myself instead of using prepare_data.py because of memory constraints. The code in prepare_date.py loads all texts into memory and process all in one pass. But it comes with memory consumption. For wikitext103, ok. But not for my dataset (bigger). So i tokenized myself:
remaining_tokens = []
with open('data_debertav3/train_wiki_brwac.raw', encoding = 'utf-8') as fs:
with open('deberta_v3_pt_tokenized/train.txt', 'w', encoding = 'utf-8') as wfs:
for l in tqdm(fs, ncols=80, desc='Loading'):
if len(l) > 0:
tokens = tokenizer.tokenize(l)
else:
tokens = []
remaining_tokens.extend(tokens)
while len(remaining_tokens) >= 510:
wfs.write(' '.join(remaining_tokens[:510]) + '\n')
remaining_tokens = remaining_tokens[510:]
Then I created a rtd_pt_continue.sh with some changes to the large version part. I also used the original debertav3-large english checkpoint downloaded from HF:
deberta-v3-large)
parameters=" --num_train_epochs 1 \
--model_config rtd_large.json \
--warmup 500 \
--learning_rate 1e-4 \
--train_batch_size 64 \
--accumulative_update 16 \
--init_generator /media/hdd6tb/jupyter/notebooks/transformers/models_debertav3_large/pytorch_model.generator.bin \
--init_discriminator /media/hdd6tb/jupyter/notebooks/transformers/models_debertav3_large/pytorch_model.bin \
--workers 8 \
--world_size -1 \
--decoupled_training True \
--fp16 True "
I also changed this part:
python -m DeBERTa.apps.run_continue --model_config config.json \
--tag $tag \
--do_train \
--num_training_steps 1000000 \
--max_seq_len $max_seq_length \
--dump 1000 \
--task_name $Task \
--data_dir $data_dir \
--vocab_path /home/fmobrj/.~DeBERTa/assets/latest/deberta-v3-large-pt/spm.model \
--vocab_type spm \
--output_dir /media/hdd6tb/pyinstalls/DeBERTa/debertav3_pt_continue_out_64_1epoch $parameters
In DeBERTa.apps, I created a run_continue.py to deal with the translation of the english original embeddings to common embeddings in both languages so I could reuse 33k of the 138k embeddings as a staring pretrained point. For this, I copied the app.py as app_continue.py and made this changes to main:
I load the pretrained english models and the portuguese tokenizer normaly as the run.py code. But I also load the english tokenizer for copying weights. For this, I added:
p,t=load_vocab(vocab_path=None, vocab_type='spm', pretrained_id='deberta-v3-large')
tokenizer_en=tokenizers[t](p)
Then I create a list with the portuguese dictionary:
voc = []
for k, v in enumerate(tokenizer.vocab):
voc.append(v)
After loading the weights of the pretrained english large models:
tens_a = model.generator.deberta.embeddings.word_embeddings.weight
toks_len = len(tokenizer.vocab)
# Get weights of the old wte
old_wgts = model.generator.deberta.embeddings.word_embeddings.weight.clone().detach()
# Get the mean embedding vetor of the old wte
wgts_m = old_wgts.mean(0)
# Initialize vocab size and weights of the new wte
new_vocab_size = 128100
new_wgts = old_wgts.clone().detach()
# Get the new wte keeping the embeddings vetors of tokens in common in the 2 vocabs
# A token present in the new vocab but not in the old one gets the mean embedding vetor of the old wte
old_vocab = tokenizer_en.vocab
new_vocab = tokenizer.vocab
same_tokens_list = list()
different_tokens_list = list()
for w,idx_new in new_vocab.items():
idx_old = old_vocab.get(w, -1)
if idx_old>=0:
print(idx_new)
new_wgts[idx_new] = old_wgts[idx_old]
same_tokens_list.append((w,idx_new))
else:
if idx_new <= 128000:
new_wgts[idx_new] = wgts_m
different_tokens_list.append((w,idx_new))
# setup in model the new wte
new_wte = nn.Embedding(new_vocab_size,old_wgts.size(1))
#new_wte.weight.data.normal_(mean=0.0, std=model.config.initializer_range)
new_wte.weight.data = new_wgts
model.generator.deberta.embeddings.word_embeddings = new_wte
print(f'Portuguese wte matrix setup done!\n\nWe kept {len(same_tokens_list)} embeddings vectors from the English one.\nWe did not kept {len(different_tokens_list)} embeddings vectors from the English one (instead, we used the old wte mean vector).\n')
# Check identical tokens between the 2 vocabs
num = 15
print(f'{num} first tokens IN common between the 2 vocabs:\n{same_tokens_list[:num]}\n')
print(f'{num} first tokens NOT in common between the 2 vocabs:\n{different_tokens_list[:num]}')
After doing this I can accelerate training because I use pretrained weights for the embeddings amlost 33k tokens common to both tokenizers.
@fmobrj hey, I'm glad things started working for you.
And thanks a ton for sharing your implementation tweaks that got you up and running, I'm sure someone in the community would highly benifit from this.
Sure. Send me a DM.
My training metrics up to now:
04/28/2023 10:08:52|INFO|RTD|00| device=cuda, n_gpu=1, distributed training=False, world_size=1 04/28/2023 10:09:01|INFO|RTD|00| Training batch size = 64 04/28/2023 10:09:01|INFO|RTD|00| Num steps = 1000000 04/28/2023 10:19:56|INFO|RTD|00| [D][0.0%][-1813.71h] Steps=100, loss=3.7474557736516, examples=6400, loss_scale=4096.0, 653.0s 04/28/2023 10:19:56|INFO|RTD|00| [G][0.0%][-1802.80h] Steps=100, loss=6.583027583360672, examples=6400, loss_scale=4096.0, 649.1s 04/28/2023 10:30:18|INFO|RTD|00| [D][0.0%][-1727.33h] Steps=200, loss=3.6008971021324396, examples=12800, loss_scale=4096.0, 622.0s 04/28/2023 10:30:18|INFO|RTD|00| [G][0.0%][-1727.34h] Steps=200, loss=6.204549537152052, examples=12800, loss_scale=4096.0, 622.0s 04/28/2023 10:40:41|INFO|RTD|00| [D][0.0%][-1732.08h] Steps=300, loss=3.5138266170521577, examples=19200, loss_scale=8192.0, 623.7s 04/28/2023 10:40:41|INFO|RTD|00| [G][0.0%][-1732.07h] Steps=300, loss=5.89812001268069, examples=19200, loss_scale=8192.0, 623.7s 04/28/2023 10:51:04|INFO|RTD|00| [D][0.0%][-1728.56h] Steps=400, loss=3.448544986248016, examples=25600, loss_scale=8192.0, 622.5s 04/28/2023 10:51:04|INFO|RTD|00| [G][0.0%][-1728.56h] Steps=400, loss=5.634530617445708, examples=25600, loss_scale=8192.0, 622.5s 04/28/2023 11:01:26|INFO|RTD|00| [D][0.1%][-1727.53h] Steps=500, loss=3.399651308357716, examples=32000, loss_scale=8192.0, 622.2s 04/28/2023 11:01:26|INFO|RTD|00| [G][0.1%][-1727.52h] Steps=500, loss=5.395305619269609, examples=32000, loss_scale=8192.0, 622.2s 04/28/2023 11:11:49|INFO|RTD|00| [D][0.1%][-1728.45h] Steps=600, loss=3.3582939718912046, examples=38400, loss_scale=16384.0, 622.6s 04/28/2023 11:11:49|INFO|RTD|00| [G][0.1%][-1728.45h] Steps=600, loss=5.177941043898463, examples=38400, loss_scale=16384.0, 622.6s 04/28/2023 11:22:12|INFO|RTD|00| [D][0.1%][-1730.23h] Steps=700, loss=3.3216172974663123, examples=44800, loss_scale=16384.0, 623.3s 04/28/2023 11:22:12|INFO|RTD|00| [G][0.1%][-1730.23h] Steps=700, loss=4.983419175297022, examples=44800, loss_scale=16384.0, 623.3s 04/28/2023 11:32:36|INFO|RTD|00| [D][0.1%][-1730.58h] Steps=800, loss=3.288218526635319, examples=51200, loss_scale=32768.0, 623.5s 04/28/2023 11:32:36|INFO|RTD|00| [G][0.1%][-1730.58h] Steps=800, loss=4.81024446234107, examples=51200, loss_scale=32768.0, 623.5s 04/28/2023 11:43:01|INFO|RTD|00| [D][0.1%][-1734.85h] Steps=900, loss=3.258779791047176, examples=57600, loss_scale=32768.0, 625.1s 04/28/2023 11:43:01|INFO|RTD|00| [G][0.1%][-1734.85h] Steps=900, loss=4.658184038798014, examples=57600, loss_scale=32768.0, 625.1s 04/28/2023 11:53:25|INFO|RTD|00| [D][0.1%][-1731.59h] Steps=1000, loss=3.231295110538602, examples=64000, loss_scale=32768.0, 624.0s 04/28/2023 11:53:28|INFO|RTD|00| Best metric: 0@1000 04/28/2023 11:53:28|INFO|RTD|00| [G][0.1%][-1739.89h] Steps=1000, loss=4.521974143728614, examples=64000, loss_scale=32768.0, 627.0s 04/28/2023 11:58:44|INFO|RTD|00| Eval results-dev-001000-1000000 04/28/2023 11:58:44|INFO|RTD|00| accuracy = 0.5296928353948044 04/28/2023 11:58:44|INFO|RTD|00| eval_loss = 3.0249011516571045 04/28/2023 11:58:44|INFO|RTD|00| eval_metric = 0.5296928353948044 04/28/2023 11:58:44|INFO|RTD|00| eval_samples = 1816583 04/28/2023 11:58:44|INFO|RTD|00| perplexity = 20.591968536376953 04/28/2023 11:58:44|INFO|RTD|00| Best metric: 0.5296928353948044@1000 04/28/2023 12:09:09|INFO|RTD|00| [D][0.1%][-2619.39h] Steps=1100, loss=3.2057523697750137, examples=70400, loss_scale=65536.0, 944.0s 04/28/2023 12:09:09|INFO|RTD|00| [G][0.1%][-2611.10h] Steps=1100, loss=4.399706438414075, examples=70400, loss_scale=65536.0, 941.0s 04/28/2023 12:19:48|INFO|RTD|00| [D][0.1%][-1775.10h] Steps=1200, loss=3.1844684945419433, examples=76800, loss_scale=4096.0, 639.8s 04/28/2023 12:19:48|INFO|RTD|00| [G][0.1%][-1775.09h] Steps=1200, loss=4.288779247055451, examples=76800, loss_scale=65536.0, 639.8s 04/28/2023 12:30:12|INFO|RTD|00| [D][0.1%][-1730.65h] Steps=1300, loss=3.162483718681794, examples=83200, loss_scale=4096.0, 623.8s 04/28/2023 12:30:12|INFO|RTD|00| [G][0.1%][-1730.65h] Steps=1300, loss=4.189101224129017, examples=83200, loss_scale=131072.0, 623.8s 04/28/2023 12:40:36|INFO|RTD|00| [D][0.1%][-1730.64h] Steps=1400, loss=3.1424950050881932, examples=89600, loss_scale=8192.0, 623.9s 04/28/2023 12:40:36|INFO|RTD|00| [G][0.1%][-1730.64h] Steps=1400, loss=4.098637332724674, examples=89600, loss_scale=131072.0, 623.9s 04/28/2023 12:51:01|INFO|RTD|00| [D][0.1%][-1731.85h] Steps=1500, loss=3.1232982428967953, examples=96000, loss_scale=8192.0, 624.4s 04/28/2023 12:51:01|INFO|RTD|00| [G][0.1%][-1731.85h] Steps=1500, loss=4.01503468931218, examples=96000, loss_scale=131072.0, 624.4s 04/28/2023 13:01:25|INFO|RTD|00| [D][0.2%][-1730.40h] Steps=1600, loss=3.1057023623771967, examples=102400, loss_scale=8192.0, 623.9s 04/28/2023 13:01:25|INFO|RTD|00| [G][0.2%][-1730.40h] Steps=1600, loss=3.9394399056630207, examples=102400, loss_scale=262144.0, 623.9s 04/28/2023 13:11:51|INFO|RTD|00| [D][0.2%][-1737.62h] Steps=1700, loss=3.0894116369678692, examples=108800, loss_scale=16384.0, 626.6s 04/28/2023 13:11:51|INFO|RTD|00| [G][0.2%][-1737.62h] Steps=1700, loss=3.8699965218454597, examples=108800, loss_scale=131072.0, 626.6s 04/28/2023 13:22:16|INFO|RTD|00| [D][0.2%][-1731.39h] Steps=1800, loss=3.073941922982534, examples=115200, loss_scale=16384.0, 624.4s 04/28/2023 13:22:16|INFO|RTD|00| [G][0.2%][-1731.38h] Steps=1800, loss=3.804973245855007, examples=115200, loss_scale=131072.0, 624.4s 04/28/2023 13:32:40|INFO|RTD|00| [D][0.2%][-1731.74h] Steps=1900, loss=3.0596004051126933, examples=121600, loss_scale=32768.0, 624.6s 04/28/2023 13:32:40|INFO|RTD|00| [G][0.2%][-1731.74h] Steps=1900, loss=3.744859905670348, examples=121600, loss_scale=262144.0, 624.6s 04/28/2023 13:43:09|INFO|RTD|00| [D][0.2%][-1743.93h] Steps=2000, loss=3.0459041997492315, examples=128000, loss_scale=32768.0, 629.1s 04/28/2023 13:43:10|INFO|RTD|00| Best metric: 0@1000 04/28/2023 13:43:10|INFO|RTD|00| [G][0.2%][-1747.23h] Steps=2000, loss=3.6884158945083616, examples=128000, loss_scale=65536.0, 630.3s 04/28/2023 13:48:27|INFO|RTD|00| Eval results-dev-002000-1000000 04/28/2023 13:48:27|INFO|RTD|00| accuracy = 0.5810821746102435 04/28/2023 13:48:27|INFO|RTD|00| eval_loss = 2.4255385398864746 04/28/2023 13:48:27|INFO|RTD|00| eval_metric = 0.5810821746102435 04/28/2023 13:48:27|INFO|RTD|00| eval_samples = 1816583 04/28/2023 13:48:27|INFO|RTD|00| perplexity = 11.308318138122559 04/28/2023 13:48:27|INFO|RTD|00| Best metric: 0.5810821746102435@2000 04/28/2023 13:58:51|INFO|RTD|00| [D][0.2%][-2611.08h] Steps=2100, loss=3.0329551556919303, examples=134400, loss_scale=32768.0, 942.0s 04/28/2023 13:58:51|INFO|RTD|00| [G][0.2%][-2607.79h] Steps=2100, loss=3.635770533194854, examples=134400, loss_scale=65536.0, 940.8s 04/28/2023 14:09:15|INFO|RTD|00| [D][0.2%][-1729.70h] Steps=2200, loss=3.020634061098099, examples=140800, loss_scale=65536.0, 624.1s 04/28/2023 14:09:15|INFO|RTD|00| [G][0.2%][-1729.71h] Steps=2200, loss=3.585938245119019, examples=140800, loss_scale=65536.0, 624.1s 04/28/2023 14:19:41|INFO|RTD|00| [D][0.2%][-1735.37h] Steps=2300, loss=3.0090439834283744, examples=147200, loss_scale=65536.0, 626.2s 04/28/2023 14:19:41|INFO|RTD|00| [G][0.2%][-1735.36h] Steps=2300, loss=3.5398798515518073, examples=147200, loss_scale=65536.0, 626.2s 04/28/2023 14:30:06|INFO|RTD|00| [D][0.2%][-1729.40h] Steps=2400, loss=2.9981075542544326, examples=153600, loss_scale=131072.0, 624.1s 04/28/2023 14:30:06|INFO|RTD|00| [G][0.2%][-1729.40h] Steps=2400, loss=3.4962715532258155, examples=153600, loss_scale=65536.0, 624.1s 04/28/2023 14:40:29|INFO|RTD|00| [D][0.2%][-1728.70h] Steps=2500, loss=2.987686046487093, examples=160000, loss_scale=131072.0, 623.9s 04/28/2023 14:40:29|INFO|RTD|00| [G][0.2%][-1728.71h] Steps=2500, loss=3.4555110546022654, examples=160000, loss_scale=65536.0, 623.9s 04/28/2023 14:50:54|INFO|RTD|00| [D][0.3%][-1729.10h] Steps=2600, loss=2.9777654351465976, examples=166400, loss_scale=131072.0, 624.1s 04/28/2023 14:50:54|INFO|RTD|00| [G][0.3%][-1729.09h] Steps=2600, loss=3.4169034168811945, examples=166400, loss_scale=131072.0, 624.1s 04/28/2023 15:01:21|INFO|RTD|00| [D][0.3%][-1738.14h] Steps=2700, loss=2.968300652189387, examples=172800, loss_scale=131072.0, 627.4s 04/28/2023 15:01:21|INFO|RTD|00| [G][0.3%][-1738.14h] Steps=2700, loss=3.380305984083701, examples=172800, loss_scale=131072.0, 627.4s 04/28/2023 15:11:45|INFO|RTD|00| [D][0.3%][-1727.39h] Steps=2800, loss=2.9588655080752715, examples=179200, loss_scale=131072.0, 623.6s 04/28/2023 15:11:45|INFO|RTD|00| [G][0.3%][-1727.40h] Steps=2800, loss=3.3451542161751004, examples=179200, loss_scale=262144.0, 623.6s 04/28/2023 15:22:08|INFO|RTD|00| [D][0.3%][-1728.12h] Steps=2900, loss=2.9502125775968207, examples=185600, loss_scale=131072.0, 623.9s 04/28/2023 15:22:09|INFO|RTD|00| [G][0.3%][-1728.11h] Steps=2900, loss=3.3122380829242797, examples=185600, loss_scale=262144.0, 623.9s 04/28/2023 15:32:39|INFO|RTD|00| [D][0.3%][-1745.65h] Steps=3000, loss=2.941771824300289, examples=192000, loss_scale=131072.0, 630.3s 04/28/2023 15:32:40|INFO|RTD|00| Best metric: 0@1000 04/28/2023 15:32:40|INFO|RTD|00| [G][0.3%][-1748.89h] Steps=3000, loss=3.2808233415335417, examples=192000, loss_scale=131072.0, 631.5s 04/28/2023 15:37:57|INFO|RTD|00| Eval results-dev-003000-1000000 04/28/2023 15:37:57|INFO|RTD|00| accuracy = 0.6014726549791559 04/28/2023 15:37:57|INFO|RTD|00| eval_loss = 2.199498414993286 04/28/2023 15:37:57|INFO|RTD|00| eval_metric = 0.6014726549791559 04/28/2023 15:37:57|INFO|RTD|00| eval_samples = 1816583 04/28/2023 15:37:57|INFO|RTD|00| perplexity = 9.020487785339355 04/28/2023 15:37:57|INFO|RTD|00| Best metric: 0.6014726549791559@3000 04/28/2023 15:48:21|INFO|RTD|00| [D][0.3%][-2610.14h] Steps=3100, loss=2.933864094413096, examples=198400, loss_scale=131072.0, 942.6s 04/28/2023 15:48:21|INFO|RTD|00| [G][0.3%][-2606.90h] Steps=3100, loss=3.251141530405129, examples=198400, loss_scale=131072.0, 941.4s 04/28/2023 15:58:48|INFO|RTD|00| [D][0.3%][-1735.18h] Steps=3200, loss=2.9259827497601507, examples=204800, loss_scale=131072.0, 626.7s 04/28/2023 15:58:48|INFO|RTD|00| [G][0.3%][-1735.18h] Steps=3200, loss=3.222162594650872, examples=204800, loss_scale=131072.0, 626.7s 04/28/2023 16:09:17|INFO|RTD|00| [D][0.3%][-1740.11h] Steps=3300, loss=2.9186493585868316, examples=211200, loss_scale=131072.0, 628.5s 04/28/2023 16:09:17|INFO|RTD|00| [G][0.3%][-1740.11h] Steps=3300, loss=3.1950162046741357, examples=211200, loss_scale=131072.0, 628.5s 04/28/2023 16:19:42|INFO|RTD|00| [D][0.3%][-1730.11h] Steps=3400, loss=2.9115301294537153, examples=217600, loss_scale=131072.0, 625.0s 04/28/2023 16:19:42|INFO|RTD|00| [G][0.3%][-1730.11h] Steps=3400, loss=3.1689292251581653, examples=217600, loss_scale=131072.0, 625.0s 04/28/2023 16:30:06|INFO|RTD|00| [D][0.3%][-1729.90h] Steps=3500, loss=2.904662127422435, examples=224000, loss_scale=262144.0, 625.0s 04/28/2023 16:30:07|INFO|RTD|00| [G][0.3%][-1729.91h] Steps=3500, loss=3.1439206391381367, examples=224000, loss_scale=262144.0, 625.0s 04/28/2023 16:40:35|INFO|RTD|00| [D][0.4%][-1739.51h] Steps=3600, loss=2.897942370403972, examples=230400, loss_scale=131072.0, 628.5s 04/28/2023 16:40:35|INFO|RTD|00| [G][0.4%][-1739.50h] Steps=3600, loss=3.1199436417201327, examples=230400, loss_scale=262144.0, 628.5s 04/28/2023 16:51:04|INFO|RTD|00| [D][0.4%][-1741.47h] Steps=3700, loss=2.8914076505480586, examples=236800, loss_scale=131072.0, 629.3s 04/28/2023 16:51:04|INFO|RTD|00| [G][0.4%][-1741.48h] Steps=3700, loss=3.0964369595393135, examples=236800, loss_scale=65536.0, 629.3s 04/28/2023 17:01:33|INFO|RTD|00| [D][0.4%][-1739.11h] Steps=3800, loss=2.8852568998815196, examples=243200, loss_scale=131072.0, 628.5s 04/28/2023 17:01:33|INFO|RTD|00| [G][0.4%][-1739.12h] Steps=3800, loss=3.0744423542975596, examples=243200, loss_scale=65536.0, 628.5s 04/28/2023 17:11:57|INFO|RTD|00| [D][0.4%][-1728.26h] Steps=3900, loss=2.879110127129616, examples=249600, loss_scale=131072.0, 624.6s 04/28/2023 17:11:57|INFO|RTD|00| [G][0.4%][-1728.25h] Steps=3900, loss=3.0530737395661, examples=249600, loss_scale=65536.0, 624.6s 04/28/2023 17:22:22|INFO|RTD|00| [D][0.4%][-1728.36h] Steps=4000, loss=2.8731452637128534, examples=256000, loss_scale=131072.0, 624.7s 04/28/2023 17:22:23|INFO|RTD|00| Best metric: 0@1000 04/28/2023 17:22:23|INFO|RTD|00| [G][0.4%][-1731.66h] Steps=4000, loss=3.0323750951420516, examples=256000, loss_scale=131072.0, 625.9s 04/28/2023 17:27:40|INFO|RTD|00| Eval results-dev-004000-1000000 04/28/2023 17:27:40|INFO|RTD|00| accuracy = 0.6137627622850154 04/28/2023 17:27:40|INFO|RTD|00| eval_loss = 2.0676419734954834 04/28/2023 17:27:40|INFO|RTD|00| eval_metric = 0.6137627622850154 04/28/2023 17:27:40|INFO|RTD|00| eval_samples = 1816583 04/28/2023 17:27:40|INFO|RTD|00| perplexity = 7.906157970428467 04/28/2023 17:27:40|INFO|RTD|00| Best metric: 0.6137627622850154@4000 04/28/2023 17:38:07|INFO|RTD|00| [D][0.4%][-2615.19h] Steps=4100, loss=2.8672350740723496, examples=262400, loss_scale=262144.0, 945.3s 04/28/2023 17:38:07|INFO|RTD|00| [G][0.4%][-2611.90h] Steps=4100, loss=3.012386541304792, examples=262400, loss_scale=65536.0, 944.2s 04/28/2023 17:48:36|INFO|RTD|00| [D][0.4%][-1739.88h] Steps=4200, loss=2.8618407594057778, examples=268800, loss_scale=131072.0, 629.0s 04/28/2023 17:48:36|INFO|RTD|00| [G][0.4%][-1739.88h] Steps=4200, loss=2.9935133764502546, examples=268800, loss_scale=65536.0, 629.0s 04/28/2023 17:59:02|INFO|RTD|00| [D][0.4%][-1729.48h] Steps=4300, loss=2.856385180693726, examples=275200, loss_scale=131072.0, 625.3s 04/28/2023 17:59:02|INFO|RTD|00| [G][0.4%][-1729.47h] Steps=4300, loss=2.9749033030347767, examples=275200, loss_scale=65536.0, 625.3s 04/28/2023 18:09:27|INFO|RTD|00| [D][0.4%][-1729.47h] Steps=4400, loss=2.8511226779256353, examples=281600, loss_scale=262144.0, 625.4s 04/28/2023 18:09:27|INFO|RTD|00| [G][0.4%][-1729.47h] Steps=4400, loss=2.956990918862549, examples=281600, loss_scale=131072.0, 625.4s 04/28/2023 18:19:56|INFO|RTD|00| [D][0.5%][-1739.91h] Steps=4500, loss=2.846155321892765, examples=288000, loss_scale=131072.0, 629.2s 04/28/2023 18:19:56|INFO|RTD|00| [G][0.5%][-1739.92h] Steps=4500, loss=2.939753359248241, examples=288000, loss_scale=131072.0, 629.2s 04/28/2023 18:30:22|INFO|RTD|00| [D][0.5%][-1729.12h] Steps=4600, loss=2.8412830391947344, examples=294400, loss_scale=131072.0, 625.4s 04/28/2023 18:30:22|INFO|RTD|00| [G][0.5%][-1729.11h] Steps=4600, loss=2.9231751872696305, examples=294400, loss_scale=262144.0, 625.4s 04/28/2023 18:40:53|INFO|RTD|00| [D][0.5%][-1745.30h] Steps=4700, loss=2.8365217295003697, examples=300800, loss_scale=131072.0, 631.3s 04/28/2023 18:40:53|INFO|RTD|00| [G][0.5%][-1745.31h] Steps=4700, loss=2.9070531798978436, examples=300800, loss_scale=131072.0, 631.3s 04/28/2023 18:51:19|INFO|RTD|00| [D][0.5%][-1729.72h] Steps=4800, loss=2.8318090912885965, examples=307200, loss_scale=131072.0, 625.7s 04/28/2023 18:51:19|INFO|RTD|00| [G][0.5%][-1729.72h] Steps=4800, loss=2.8913460950460284, examples=307200, loss_scale=131072.0, 625.7s 04/28/2023 19:01:44|INFO|RTD|00| [D][0.5%][-1728.11h] Steps=4900, loss=2.8272935594831194, examples=313600, loss_scale=131072.0, 625.2s 04/28/2023 19:01:44|INFO|RTD|00| [G][0.5%][-1728.11h] Steps=4900, loss=2.876301665801783, examples=313600, loss_scale=131072.0, 625.2s 04/28/2023 19:12:15|INFO|RTD|00| [D][0.5%][-1745.52h] Steps=5000, loss=2.822846501916647, examples=320000, loss_scale=131072.0, 631.5s 04/28/2023 19:12:16|INFO|RTD|00| Best metric: 0@1000 04/28/2023 19:12:16|INFO|RTD|00| [G][0.5%][-1748.81h] Steps=5000, loss=2.8615684445723892, examples=320000, loss_scale=131072.0, 632.7s 04/28/2023 19:17:34|INFO|RTD|00| Eval results-dev-005000-1000000 04/28/2023 19:17:34|INFO|RTD|00| accuracy = 0.6218273538836375 04/28/2023 19:17:34|INFO|RTD|00| eval_loss = 1.9890456199645996 04/28/2023 19:17:34|INFO|RTD|00| eval_metric = 0.6218273538836375 04/28/2023 19:17:34|INFO|RTD|00| eval_samples = 1816583 04/28/2023 19:17:34|INFO|RTD|00| perplexity = 7.308555603027344 04/28/2023 19:17:34|INFO|RTD|00| Best metric: 0.6218273538836375@5000
...
05/04/2023 09:20:31|INFO|RTD|00| Eval results-dev-078000-1000000 05/04/2023 09:20:31|INFO|RTD|00| accuracy = 0.6778170884567344 05/04/2023 09:20:31|INFO|RTD|00| eval_loss = 1.5253663063049316 05/04/2023 09:20:31|INFO|RTD|00| eval_metric = 0.6778170884567344 05/04/2023 09:20:31|INFO|RTD|00| eval_samples = 1816583 05/04/2023 09:20:31|INFO|RTD|00| perplexity = 4.596827507019043 05/04/2023 09:20:31|INFO|RTD|00| Best metric: 0.6780312267592508@77000 05/04/2023 09:30:57|INFO|RTD|00| [D][7.8%][-2419.09h] Steps=78100, loss=2.4250781245721402, examples=4998400, loss_scale=524288.0, 944.6s 05/04/2023 09:30:57|INFO|RTD|00| [G][7.8%][-2416.05h] Steps=78100, loss=1.7966782870798081, examples=4998400, loss_scale=262144.0, 943.5s 05/04/2023 09:41:23|INFO|RTD|00| [D][7.8%][-1601.67h] Steps=78200, loss=2.424977005683362, examples=5004800, loss_scale=524288.0, 625.5s 05/04/2023 09:41:23|INFO|RTD|00| [G][7.8%][-1601.67h] Steps=78200, loss=1.7964490406254254, examples=5004800, loss_scale=262144.0, 625.5s 05/04/2023 09:51:53|INFO|RTD|00| [D][7.8%][-1612.42h] Steps=78300, loss=2.4248674900373053, examples=5011200, loss_scale=524288.0, 629.8s 05/04/2023 09:51:53|INFO|RTD|00| [G][7.8%][-1612.42h] Steps=78300, loss=1.7962196124640042, examples=5011200, loss_scale=262144.0, 629.8s 05/04/2023 10:02:17|INFO|RTD|00| [D][7.8%][-1599.35h] Steps=78400, loss=2.424755641226942, examples=5017600, loss_scale=524288.0, 624.7s 05/04/2023 10:02:17|INFO|RTD|00| [G][7.8%][-1599.34h] Steps=78400, loss=1.7959776258880595, examples=5017600, loss_scale=524288.0, 624.7s 05/04/2023 10:12:47|INFO|RTD|00| [D][7.8%][-1611.24h] Steps=78500, loss=2.4246351868436213, examples=5024000, loss_scale=524288.0, 629.5s 05/04/2023 10:12:47|INFO|RTD|00| [G][7.8%][-1611.25h] Steps=78500, loss=1.795732392026598, examples=5024000, loss_scale=262144.0, 629.5s 05/04/2023 10:23:12|INFO|RTD|00| [D][7.9%][-1600.67h] Steps=78600, loss=2.4245263463125086, examples=5030400, loss_scale=524288.0, 625.4s 05/04/2023 10:23:12|INFO|RTD|00| [G][7.9%][-1600.67h] Steps=78600, loss=1.795498022755262, examples=5030400, loss_scale=131072.0, 625.4s 05/04/2023 10:33:35|INFO|RTD|00| [D][7.9%][-1594.74h] Steps=78700, loss=2.4244049045073375, examples=5036800, loss_scale=524288.0, 623.1s 05/04/2023 10:33:35|INFO|RTD|00| [G][7.9%][-1594.74h] Steps=78700, loss=1.7952577047434835, examples=5036800, loss_scale=131072.0, 623.1s 05/04/2023 10:44:03|INFO|RTD|00| [D][7.9%][-1604.87h] Steps=78800, loss=2.4242955152931516, examples=5043200, loss_scale=524288.0, 627.2s 05/04/2023 10:44:03|INFO|RTD|00| [G][7.9%][-1604.87h] Steps=78800, loss=1.7950223664168343, examples=5043200, loss_scale=262144.0, 627.2s 05/04/2023 10:54:27|INFO|RTD|00| [D][7.9%][-1596.64h] Steps=78900, loss=2.424197508426919, examples=5049600, loss_scale=524288.0, 624.0s 05/04/2023 10:54:27|INFO|RTD|00| [G][7.9%][-1596.63h] Steps=78900, loss=1.7948031684900285, examples=5049600, loss_scale=262144.0, 624.0s 05/04/2023 11:04:55|INFO|RTD|00| [D][7.9%][-1607.53h] Steps=79000, loss=2.4240818894606413, examples=5056000, loss_scale=524288.0, 628.3s 05/04/2023 11:04:56|INFO|RTD|00| Best metric: 0@1000 05/04/2023 11:04:56|INFO|RTD|00| [G][7.9%][-1610.53h] Steps=79000, loss=1.7945597836827458, examples=5056000, loss_scale=262144.0, 629.5s 05/04/2023 11:10:13|INFO|RTD|00| Eval results-dev-079000-1000000 05/04/2023 11:10:13|INFO|RTD|00| accuracy = 0.6774967067290621 05/04/2023 11:10:13|INFO|RTD|00| eval_loss = 1.5252898931503296 05/04/2023 11:10:13|INFO|RTD|00| eval_metric = 0.6774967067290621 05/04/2023 11:10:13|INFO|RTD|00| eval_samples = 1816583 05/04/2023 11:10:13|INFO|RTD|00| perplexity = 4.596476078033447 05/04/2023 11:10:13|INFO|RTD|00| Best metric: 0.6780312267592508@77000 05/04/2023 11:20:39|INFO|RTD|00| [D][7.9%][-2414.33h] Steps=79100, loss=2.423960942559688, examples=5062400, loss_scale=524288.0, 943.8s 05/04/2023 11:20:39|INFO|RTD|00| [G][7.9%][-2411.32h] Steps=79100, loss=1.794316322505898, examples=5062400, loss_scale=262144.0, 942.6s 05/04/2023 11:31:08|INFO|RTD|00| [D][7.9%][-1610.15h] Steps=79200, loss=2.4238396695358775, examples=5068800, loss_scale=262144.0, 629.5s 05/04/2023 11:31:08|INFO|RTD|00| [G][7.9%][-1610.15h] Steps=79200, loss=1.7940669304405272, examples=5068800, loss_scale=131072.0, 629.5s 05/04/2023 11:41:32|INFO|RTD|00| [D][7.9%][-1593.95h] Steps=79300, loss=2.423719405837141, examples=5075200, loss_scale=262144.0, 623.2s 05/04/2023 11:41:32|INFO|RTD|00| [G][7.9%][-1593.95h] Steps=79300, loss=1.7938239845860877, examples=5075200, loss_scale=131072.0, 623.2s 05/04/2023 11:51:55|INFO|RTD|00| [D][7.9%][-1593.24h] Steps=79400, loss=2.423609061165424, examples=5081600, loss_scale=262144.0, 623.0s 05/04/2023 11:51:55|INFO|RTD|00| [G][7.9%][-1593.24h] Steps=79400, loss=1.7935895315461068, examples=5081600, loss_scale=262144.0, 623.0s
The discriminator seems to be learning nothing. After 79k steps at each 1k dump, the discriminator result is: Best metric: 0@1000, despite the generator learning and diminishing loss and increasing accuracy.
Hi, @BigBird01! Is it expected that during training, the metric for the Discriminator stays at "Best metric: 0@1000" after many steps (currently it is at 5811200 examples and 90k steps)? The generator is improving accuracy (~0.68) and loss (1.51).
@fmobrj did you get to try your checkpoints in any downstream task to see if the training is working?
Hi, @pvcastro ! I tried, but it is not converging when applying the model to a classification task in portuguese that works even with the english pretrained model in Huggingface. I suspect the discriminator is not well trained enough. I stop pretarining with a G loss of 1.28 and 71.6 accuracy. But D validation report shows 0@250 after almost 200k steps with a batch size of 64.
I've setup the environment according to the instruction, and try pretraining the rtd model as follows:
_Deberta/experiments/languagemodel/bash rtd.sh deberta-v3-base
But I got error message as following:
_Traceback (most recent call last): File "./prepare_data.py", line 37, in
tokenize_data(args.input, args.output, args.max_seq_length)
File "./prepare_data.py", line 9, in tokenize_data
tokenizer=deberta.tokenizerst
File "/usr/local/lib/python3.6/dist-packages/DeBERTa/deberta/spm_tokenizer.py", line 29, in init
assert os.path.exists(vocabfile)
File "/usr/lib/python3.6/genericpath.py", line 19, in exists
os.stat(path)
TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType
Is it because I did not specify vocab file? Or did I miss something in setting up environment?