I am training a reinforcement learning model. The same code running on tpu and cpu produces very different outcome. I trained each model for 100 steps. Here is showing the last few step. First tpu(xla) (The first number is step, the second is user id, third max reward, last actual reward):

88 0 92.0 23.0

88 1 58.0 9.0

88 2 34.0 3.0

88 3 33.0 10.0

88 4 35.0 4.0

=====================

89 0 92.0 22.0

89 1 58.0 6.0

89 2 34.0 2.0

89 3 33.0 5.0

89 4 35.0 5.0

=====================

90 0 92.0 17.0

90 1 58.0 4.0

90 2 34.0 2.0

90 3 33.0 3.0

90 4 35.0 3.0

=====================

91 0 92.0 13.0

91 1 58.0 4.0

91 2 34.0 4.0

91 3 33.0 3.0

91 4 35.0 3.0

=====================

92 0 92.0 11.0

92 1 58.0 1.0

92 2 34.0 3.0

92 3 33.0 2.0

92 4 35.0 2.0

=====================

93 0 92.0 8.0

93 1 58.0 5.0

93 2 34.0 3.0

93 3 33.0 2.0

93 4 35.0 1.0

=====================

94 0 92.0 8.0

94 1 58.0 0.0

94 2 34.0 1.0

94 3 33.0 5.0

94 4 35.0 1.0

=====================

95 0 92.0 6.0

95 1 58.0 4.0

95 2 34.0 2.0

95 3 33.0 2.0

95 4 35.0 3.0

=====================

96 0 92.0 8.0

96 1 58.0 3.0

96 2 34.0 4.0

96 3 33.0 4.0

96 4 35.0 4.0

=====================

97 0 92.0 4.0

97 1 58.0 7.0

97 2 34.0 2.0

97 3 33.0 3.0

97 4 35.0 1.0

=====================

98 0 92.0 6.0

98 1 58.0 7.0

98 2 34.0 2.0

98 3 33.0 1.0

98 4 35.0 3.0

=====================

99 0 92.0 2.0

99 1 58.0 4.0

99 2 34.0 2.0

99 3 33.0 1.0

99 4 35.0 1.0

this is not learning at all on tpu with xla.

Now on cpu, the same code:

88 0 92.0 88.0

88 1 58.0 54.0

88 2 34.0 30.0

88 3 33.0 29.0

88 4 35.0 32.0

=====================

89 0 92.0 86.0

89 1 58.0 54.0

89 2 34.0 32.0

89 3 33.0 31.0

89 4 35.0 32.0

=====================

90 0 92.0 84.0

90 1 58.0 57.0

90 2 34.0 31.0

90 3 33.0 30.0

90 4 35.0 34.0

=====================

91 0 92.0 86.0

91 1 58.0 54.0

91 2 34.0 32.0

91 3 33.0 31.0

91 4 35.0 34.0

=====================

92 0 92.0 86.0

92 1 58.0 56.0

92 2 34.0 32.0

92 3 33.0 31.0

92 4 35.0 33.0

=====================

93 0 92.0 89.0

93 1 58.0 52.0

93 2 34.0 31.0

93 3 33.0 31.0

93 4 35.0 35.0

=====================

94 0 92.0 87.0

94 1 58.0 58.0

94 2 34.0 32.0

94 3 33.0 29.0

94 4 35.0 34.0

=====================

95 0 92.0 87.0

95 1 58.0 54.0

95 2 34.0 32.0

95 3 33.0 31.0

95 4 35.0 32.0

=====================

96 0 92.0 86.0

96 1 58.0 55.0

96 2 34.0 30.0

96 3 33.0 29.0

96 4 35.0 32.0

=====================

97 0 92.0 89.0

97 1 58.0 52.0

97 2 34.0 33.0

97 3 33.0 31.0

97 4 35.0 34.0

=====================

98 0 92.0 87.0

98 1 58.0 50.0

98 2 34.0 32.0

98 3 33.0 32.0

98 4 35.0 32.0

=====================

99 0 92.0 90.0

99 1 58.0 52.0

99 2 34.0 32.0

99 3 33.0 32.0

99 4 35.0 34.0

This shows the model is learning to converge on cpu, the same code.

pytorch / xla

Big difference between pytorch xla and pytorch on cpu #4170

I am training a reinforcement learning model. The same code running on tpu and cpu produces very different outcome. I trained each model for 100 steps. Here is showing the last few step. First tpu(xla) (The first number is step, the second is user id, third max reward, last actual reward):

this is not learning at all on tpu with xla.

Now on cpu, the same code:

This shows the model is learning to converge on cpu, the same code.