loss_std resulting in complex number and breaking Tensorboard

fernandocamargoai commented 6 years ago

I'm using torchbearer with PyTorch 0.4 and TensorboardX 1.2. Previously, I was using PyTorch 0.4.1, but I had to downgrade to use the TensorboardX because of a incompatibility with them. After adding the Tensorboard callback, the following error is raised after training for some time:

{TypeError}can't convert complex to float

When debugging, I noticed that the add_scalar() of TensorboardX tried to convert the scalar to float and, somehow, the val_loss_std was a complex number. Is there and error in how the std is calculated in order to result in a complex number?

MattPainter01 commented 6 years ago

@fernandocamargoti Thanks for the issue.

I had a look at the std calculation and I can see that it would go complex if the losses are extremely small, for example, in my testing I get a complex number if I calculate the std([1e-30, 1e-30]). I think pytorch float tensors default to single precision, so this is probably some underflow error.

We should definitely have a check for this and at least return -1 or something similar instead of blindly returning a complex number. I'll make this change now.

Are your validation losses very small or do you think the problem is elsewhere?

fernandocamargoai commented 6 years ago

Hello, @MattPainter01.

I'm not sure it's the case. Have I look at the prints before the error:

0/100(t): 100%|██████████| 2/2 [00:01<00:00,  1.83it/s, running_loss=0.775, precision=0.012, recall=0.8, loss_std=0.0259, loss=0.749]
0/100(v): 100%|██████████| 1/1 [00:00<00:00,  9.99it/s, val_precision=0.012, val_recall=0.757, val_loss_std=0.000117, val_loss=0.67]
1/100(t): 100%|██████████| 2/2 [00:00<00:00,  3.54it/s, running_loss=0.722, precision=0.0111, recall=0.189, loss_std=0.0363, loss=0.632]
1/100(v): 100%|██████████| 1/1 [00:00<00:00, 16.32it/s, val_precision=0.0108, val_recall=0.0969, val_loss_std=6.22e-05, val_loss=0.507]
2/100(t): 100%|██████████| 2/2 [00:00<00:00,  2.89it/s, running_loss=0.654, precision=0.0101, recall=0.0667, loss_std=0.0483, loss=0.458]
2/100(v): 100%|██████████| 1/1 [00:00<00:00, 18.95it/s, val_precision=0.0133, val_recall=0.0344, val_loss_std=3.43e-21+5.61e-05j, val_loss=0.314]
Traceback (most recent call last):
  File "/home/fernandocamargo/datascience_workspace/recommendation-system/test.py", line 6, in <module>
    task.run()
  File "/home/fernandocamargo/datascience_workspace/recommendation-system/recommendation/task/base.py", line 77, in run
    self.train()
  File "/home/fernandocamargo/datascience_workspace/recommendation-system/recommendation/task/base.py", line 127, in train
    callbacks=self._get_callbacks())
  File "/home/fernandocamargo/anaconda3/envs/recommendation-system/lib/python3.6/site-packages/torchbearer/torchbearer.py", line 209, in fit_generator
    _callbacks.on_end_epoch(state)
  File "/home/fernandocamargo/anaconda3/envs/recommendation-system/lib/python3.6/site-packages/torchbearer/callbacks/callbacks.py", line 281, in on_end_epoch
    self._for_list(lambda callback: callback.on_end_epoch(state))
  File "/home/fernandocamargo/anaconda3/envs/recommendation-system/lib/python3.6/site-packages/torchbearer/callbacks/callbacks.py", line 191, in _for_list
    function(callback)
  File "/home/fernandocamargo/anaconda3/envs/recommendation-system/lib/python3.6/site-packages/torchbearer/callbacks/callbacks.py", line 281, in <lambda>
    self._for_list(lambda callback: callback.on_end_epoch(state))
  File "/home/fernandocamargo/anaconda3/envs/recommendation-system/lib/python3.6/site-packages/torchbearer/callbacks/tensor_board.py", line 97, in on_end_epoch
    self._writer.add_scalar('epoch/' + metric, state[torchbearer.METRICS][metric], state[torchbearer.EPOCH])
  File "/home/fernandocamargo/anaconda3/envs/recommendation-system/lib/python3.6/site-packages/tensorboardX/writer.py", line 272, in add_scalar
    self.file_writer.add_summary(scalar(tag, scalar_value), global_step)
  File "/home/fernandocamargo/anaconda3/envs/recommendation-system/lib/python3.6/site-packages/tensorboardX/summary.py", line 88, in scalar
    scalar = float(scalar)
TypeError: can't convert complex to float

When the val_loss was 0.507, the val_loss_std was 6.22e-05. But when the val_loss was 0.314, the error happened and the val_loss_std was 3.43e-21+5.61e-05j.

fernandocamargoai commented 6 years ago

My current workaround was to disable the std for the loss, adding this in my code:

@metrics.default_for_key('loss')
@metrics.running_mean
@metrics.mean
class SimpleLossFactory(metrics.MetricFactory):
        def build(self):
            return Loss()

MattPainter01 commented 6 years ago

Does the error still occur when you have more than 1 validation sample?

For single samples or multiple samples with the same value then precision errors casting from pytorch floats to python floats can give us negative variances.

For the moment I'll set it to return a variance of 0 for these situations, checkout branch fix/std_complex in the meantime until we merge this.

ethanwharris commented 6 years ago

Closed by #296

pytorchbearer / torchbearer

loss_std resulting in complex number and breaking Tensorboard #290