Validation takes very long time (10x)

thangld201 commented 6 months ago

I realized that validation was weirdly long. To show this, I changed Line 84-96 of NAST/styletransformer/seq2seq.py to the followings:

while self.now_epoch < args.epochs:
    self.now_epoch += 1
    self.updateOtherWeights()
    import time 
    st_bo = time.time()
    self.train(args.eval_steps)
    logging.info("took %.2f minutes", (time.time() - st_bo) / 60) # <------------ add to measure train time

    import time
    st_bo = time.time()
    devloss_detail = self.evaluate("dev", hasref=False, write_output=False)
    self.devSummary(self.now_batch, devloss_detail)
    logging.info("epoch %d, evaluate dev", self.now_epoch)
    logging.info("took %.2f minutes", (time.time() - st_bo) / 60) # <------------ add to measure val time

    import time
    st_bo = time.time()
    testloss_detail = self.evaluate("test", hasref=True, write_output=True)
    self.testSummary(self.now_batch, testloss_detail)
    logging.info("epoch %d, evaluate test", self.now_epoch)
    logging.info("took %.2f minutes", (time.time() - st_bo) / 60) # <------------ add to measure test time

And the logs show:

Training start......                                                                                                                 
train_0 set restart, 2769 batches and 2 left                                                                                         
train_1 set restart, 4156 batches and 57 left
[iter 505] d_adv_loss: 2.7750  f_slf_loss: 6.8572  f_cyc_loss: 9.1048  f_adv_loss: 1.7901  f_slf_length_loss: 0.0000  f_cyc_length_loss: 0.0000  temp: 1.0000  f_slf_gen_error: 0.0000 f_cyc_gen_error: 0.0000
...
15:04:14 seq2seq.py[line:90] took 1.05 minutes # training
dev_0 set restart, 31 batches and 16 left
dev_1 set restart, 31 batches and 16 left
15:14:47 seq2seq.py[line:98] epoch 1, evaluate dev
15:14:47 seq2seq.py[line:99] took 10.55 minutes # validation
test_0 set restart, 7 batches and 52 left
15:17:12 seq2seq.py[line:106] epoch 1, evaluate test                                                                                 
15:17:12 seq2seq.py[line:107] took 2.42 minutes # test

So validation alone runs 10x slower than training. Do you have any idea why ? @hzhwcmhf

hzhwcmhf commented 6 months ago

@thangld201 I am not sure but here are some possible causes:

BLEU calculation may be slow.
A classifier is used to determine the style.

To eliminate these effects, can you measure the time between this lines? https://github.com/thu-coai/NAST/blob/ef765d412f6e9a2ebdcc7d62c99ec2e883d0e17a/styletransformer/seq2seq.py#L313-L324

thangld201 commented 6 months ago

@hzhwcmhf , I found that Line 334: predict_res.append(self.param.volatile.cls.predict_str(sents)) was the cause (took up 99% of the time). I will dive a bit more....not sure why this is slow...

hzhwcmhf commented 6 months ago

@hzhwcmhf , I found that Line 334: predict_res.append(self.param.volatile.cls.predict_str(sents)) was the cause (took up 99% of the time). I will dive a bit more....not sure why this is slow...

It runs a classifier. Maybe check whether this classifier is on gpu?

thangld201 commented 6 months ago

It runs a classifier. Maybe check whether this classifier is on gpu?

Yeah, you are right, self.param.volatile.cls.net was on cpu.

In NAST/styletransformer/run_cls.py, Line 47-69:

L47: parser.add_argument('--cuda', action="store_true", help='Use cuda (gpu).')
...
L69: args.cuda = "cuda" if cargs.cuda else "cpu"

since cargs was parsed from Line 64 in NAST/styletransformer/main.py i.e.
cls_param.args = run_cls.run("--dryrun", "--restore", args.clsrestore) # <----- do not have cuda flags

so I changed Line 64 to cls_param.args = run_cls.run("--cuda", "--dryrun", "--restore", args.clsrestore)

And it worked! Validation now takes less than half a minute (Yelp)! Thank you!

thu-coai / NAST

Validation takes very long time (10x) #7