pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.47k stars 471 forks source link

TPU training result is much worse than CPU #1326

Closed crystina-z closed 4 years ago

crystina-z commented 4 years ago

❓ Questions and Help

I'm trying to run repo on single TPU core and I've make the model run on a pretty reasonable speed. However, it seems to me that the results trained on TPU are quite worse than CPU:

training with CPU: image

training with TPU: image

I put my code here, the master branch has code for tpu training and cpu branch has code for cpu training; but basically the only changes are the device in train.py, data.py and modeling.py;

since the dataset is not published, it's not directly reproducible. But is there any suggestions about the possible reasons for this inconsistency?

ailzhang commented 4 years ago

Hey @Crysitna , thanks for reporting! Does inference produce the same result on both models btw? In this case when the dataset is private, it'll help a lot if you could give us a sample fake data to show that CPU and TPU produces different results, that'll help us to track down the root cause and fix :D .

crystina-z commented 4 years ago

@ailzhang heyy thx for comment, will run the inference and try to get a sample dataset!

crystina-z commented 4 years ago

@ailzhang hey I add the sample docs and sh to run the files here, to get the result, u can run ./sample_run/xla_training_vanilla.sh then ./sample_run/xla_inference.sh then ./sample_run/eval.sh;

yet u might need to change the device in each .py file (data.py, modeling.py, train.py..

for the inference result i'm not quite sure. I tried to load the model from trained weight files, yet it seems the results r different depending on map_location of model.load()

e.g. if the weight.file is store from xla model, then

model_1 = model.load(args.model_weights.name, map_location='cpu').to('xla')
model_2 = model.load(args.model_weights.name, map_location='xla')

would generate different inference results tho both model is run on xla. yet

model_1 = model.load(args.model_weights.name, map_location='cpu').to('xla')
model_2 = model.load(args.model_weights.name, map_location='cpu')

would give same results tho one inference is done on xla while another is done on cpu;

Hope the infomation helps

zcain117 commented 4 years ago

Thanks for putting together the sample script @Crysitna! I'll try training the model.

Just to confirm, were you running on the nightly pytorch/xla build? And what was your method of using pytorch/xla: was it from docker run or did you use e.g. conda activate torch-xla-nightly or did you build pytorch/xla manually?

crystina-z commented 4 years ago

oh the machine is xla-nightly, and I tried both torch-xla-nightly and torch-xla-0.5 environment yet it doesn't make a difference

zcain117 commented 4 years ago

p20

nohup.txt

Train loss seems to decrease quickly but validation doesn't look right.

The graphs you posted - were those from run_model or a different way of evaluating?

zcain117 commented 4 years ago

device in modeling.py: xla:1 device in data.py: xla:1 device in training.py: xla:1 train epoch=0 loss=15.407709121704102 validation epoch=0 score=0.05 new top validation score, saving weights train epoch=1 loss=0.002857685089111328 validation epoch=1 score=0.05 train epoch=2 loss=0.0019013285636901855 validation epoch=2 score=0.05 train epoch=3 loss=0.0016412138938903809 validation epoch=3 score=0.05 train epoch=4 loss=0.001498878002166748 validation epoch=4 score=0.05 train epoch=5 loss=0.0013256072998046875 validation epoch=5 score=0.05 train epoch=6 loss=0.001251518726348877 validation epoch=6 score=0.05 train epoch=7 loss=0.0011090636253356934 validation epoch=7 score=0.05 train epoch=8 loss=0.000981152057647705 validation epoch=8 score=0.05 train epoch=9 loss=0.0008842945098876953 validation epoch=9 score=0.05 train epoch=10 loss=0.0008647441864013672 validation epoch=10 score=0.05 train epoch=11 loss=0.0007768869400024414 validation epoch=11 score=0.05 train epoch=12 loss=0.0007230043411254883 validation epoch=12 score=0.05 train epoch=13 loss=0.0006778836250305176 validation epoch=13 score=0.05 train epoch=14 loss=0.000605165958404541 validation epoch=14 score=0.05 train epoch=15 loss=0.0005647540092468262 validation epoch=15 score=0.05 train epoch=16 loss=0.0005486607551574707 validation epoch=16 score=0.05 train epoch=17 loss=0.0005112886428833008 validation epoch=17 score=0.05 train epoch=18 loss=0.0004782676696777344 validation epoch=18 score=0.05 train epoch=19 loss=0.0004400014877319336 validation epoch=19 score=0.05 train epoch=20 loss=0.00043064355850219727 validation epoch=20 score=0.05 train epoch=21 loss=0.00041866302490234375 validation epoch=21 score=0.05 train epoch=22 loss=0.00037992000579833984 validation epoch=22 score=0.05 train epoch=23 loss=0.0003535747528076172 validation epoch=23 score=0.05 train epoch=24 loss=0.00035500526428222656 validation epoch=24 score=0.05 train epoch=25 loss=0.00033724308013916016 validation epoch=25 score=0.05 train epoch=26 loss=0.00031507015228271484 validation epoch=26 score=0.05 train epoch=27 loss=0.00029736757278442383 validation epoch=27 score=0.05 train epoch=28 loss=0.00028514862060546875 validation epoch=28 score=0.05 train epoch=29 loss=0.0002726316452026367 validation epoch=29 score=0.05 saved validation scores into p20.png

zcain117 commented 4 years ago

Also: does your training loss look similar on CPU vs TPU?

crystina-z commented 4 years ago

@zcain117 oh sorry for missing these messages

it makes sense to me if the validation value doesnt change on the sample dataset: since as long as the ranking of the items in the valid set remains same (their relative scores r same), the valid score r same.

But true the sample dataset can just be used to run the model but not exactly reproduce the error.. but its result on cpu and tpu r truly different as mentioned here:

it seems the results r different depending on map_location of model.load()

e.g. if the weight.file is store from xla model, then

model_1 = model.load(args.model_weights.name, map_location='cpu').to('xla')
model_2 = model.load(args.model_weights.name, map_location='xla')

would generate different inference results tho both model is run on xla. yet

model_1 = model.load(args.model_weights.name, map_location='cpu').to('xla')
model_2 = model.load(args.model_weights.name, map_location='cpu')

I plot the loss obtained on the full dataset for 10 epoch and they r not exactly same: loss on tpu: image

loss on cpu: image

zcain117 commented 4 years ago

If you are sending any tensors to TPU during training, it's possible that this fix will help: https://github.com/pytorch/xla/pull/1411

Without that fix, we were sometimes dropping the requires_grad flag on tensors when moving tensors to XLA device. That fix will appear in tomorrow's nightly build.

I don't think it would affect inference though. On the real dataset, does the TPU training loss reach a similar number to the CPU training loss?

You were also saying that inference is different depending on device even using the same model weights. Have you tried running a single example through the CPU inference vs the TPU inference to see the prediction tensor? I'm wondering if TPU is just predicting some kind of deterministic value. I see the "p20" score is about 0.21 for TPU inference: does that correspond to guessing the same value over and over?

zcain117 commented 4 years ago

I also wanted to clarify: you said you load the model and run inference and output is different on TPU vs CPU. You gave the example of:

model_1 = model.load(args.model_weights.name, map_location='cpu').to('xla')

Is this different than torch.load() or model.load_state_dict()? https://pytorch.org/tutorials/beginner/saving_loading_models.html

Glorf commented 4 years ago

Hi, thanks for filling this bug I'm currently encountering the very same issue while fine-tuning huggingface transformers' gpt2 on TPU. My code is based on AllenAI tpu_pretrain (https://github.com/allenai/tpu_pretrain/blob/master/pretrain.py) The loss function decreases faster, the validation achieves better results, but after the model is saved and inference is run, the actual inference performance is a complete diseaster.

I'm running the latest pytorch-xla-nightly, with the mentioned bugfix, with xla multiprocessing approach

zcain117 commented 4 years ago

@Glorf you mention "loss decreases faster, the validation achieves better results": is this comparing loss/validation training on TPU vs training on CPU?

When you say "the inference performance is a complete disaster", I am assuming that inference on TPU is worse than inference on CPU?

Also, how are you loading the model to perform the inference? Any way that we could repro this would be very helpful

Glorf commented 4 years ago

Thanks for your response. I first load the pretrained model using huggingface's from_pretrained interface. Then I pass the model to xla using .to(device). The model seems to train properly (I belive it does as the validation shows decent perplexity). I save the model using xm.save(...).

And that's where the problem begins, when I reload the model on CPU using from_pretrained with "normal" pytorch, it turns out to work much worse. I suppose that the problem resides somewhere in the saving/loading stage as I checked all my pipeline and cannot see other issue

However, It's hard to compare with code on GPU training that I used before, as I had to rewrite it completely to run on multiprocessing. I cannot share the code RN, unfortunately, but am eager to provide you any additional details which may help.

The important thing I just discovered is that the model trains properly when using XLA_USE_BF16 (shows worse validation results on training, but runs as expected after saving)

RESULT: the bf16 training works well enough for me RN, but it might be helpful to remember about this issue

jysohn23 commented 4 years ago

Hello @Glorf, awesome to hear you've been working with HF + PyTorch + TPUs. We're actually also doing this work to make it officially work with HF and eventually plan to upstream our runners.

I've been able to get the from_pretrained + checkpointing working to the extent that we're able to checkpoint once models are fully finetuned and then subsequently load it back from the checkpoint and run eval. I've been able to get good accuracy, f1, etc numbers on Glue datasets.

Please feel free to check this runner example out: https://github.com/pytorch-tpu/transformers/blob/tpu/examples/run_glue_tpu.py.

These are the PRs that were needed.

ailzhang commented 4 years ago

w.r.t XLA_USE_BF16 maybe we could send out a warning about the validation result precision after we detect BF16 is on? What do you think @dlibenzi @jysohn23 @taylanbil ?

dlibenzi commented 4 years ago

A warning ... from where? We do log when BF16 is enabled already.

dlibenzi commented 4 years ago

@Glorf Is this single or multi-core?

Glorf commented 4 years ago

The training is XLA multiprocessing on 8 CPU cores, with TPUv3-8 as backend I'm still working on this, trying to better understand what happens. 'll keep you updated

dlibenzi commented 4 years ago

You mean 8 TPU cores, right? When debugging, it is better to use single core (pass nprocs=1 to xmp.spawn().

dlibenzi commented 4 years ago

Also @Glorf , about this:

However, It's hard to compare with code on GPU training that I used before, as I had to rewrite it completely to run on multiprocessing.

We would like to hear which kind of huge changes you had to go through, to port GPU multi-processing to TPU multi-processing, as it might serve us as guideline and drive API refinements. Thanks!

Glorf commented 4 years ago

@dlibenzi The thing is my GPU training code didn't use multiprocessing at all. I used it here, after reading that it may give the training significant performance boost. Also in my GPU code I used NVIDIA Apex mixed-precision, which was easy to remove, but makes comparing overall model perfomance during training more tricky. Finally, my fault, I didn't know about pytorch-tpu/transformers (thanks @jysohn23 for a link) so I implemented the necessary model loading locks etc. myself. My next guess will be to use your implementation and see if I implemented something incorrectly.

jysohn23 commented 4 years ago

I've only implemented the GLUE task bits for a couple of models, but the next task will be the LM finetuning including GPT2. Per synchronization and locking there seems to be a bug that PyTorch doesn't work well with Python Multiprocessing: https://github.com/pytorch/pytorch/issues/29855. But hopefully we should be able to use stuff like python barriers once that's fixed.

dlibenzi commented 4 years ago

@Glorf Thanks, great to know! 👍

zcain117 commented 4 years ago

Marking as closed for now. If anyone has specific questions or bugs, feel free to open a new issue with the details