Closed fernandocamargoai closed 6 years ago
@fernandocamargoti Thanks for the issue.
I had a look at the std calculation and I can see that it would go complex if the losses are extremely small, for example, in my testing I get a complex number if I calculate the std([1e-30, 1e-30]). I think pytorch float tensors default to single precision, so this is probably some underflow error.
We should definitely have a check for this and at least return -1 or something similar instead of blindly returning a complex number. I'll make this change now.
Are your validation losses very small or do you think the problem is elsewhere?
Hello, @MattPainter01.
I'm not sure it's the case. Have I look at the prints before the error:
0/100(t): 100%|██████████| 2/2 [00:01<00:00, 1.83it/s, running_loss=0.775, precision=0.012, recall=0.8, loss_std=0.0259, loss=0.749]
0/100(v): 100%|██████████| 1/1 [00:00<00:00, 9.99it/s, val_precision=0.012, val_recall=0.757, val_loss_std=0.000117, val_loss=0.67]
1/100(t): 100%|██████████| 2/2 [00:00<00:00, 3.54it/s, running_loss=0.722, precision=0.0111, recall=0.189, loss_std=0.0363, loss=0.632]
1/100(v): 100%|██████████| 1/1 [00:00<00:00, 16.32it/s, val_precision=0.0108, val_recall=0.0969, val_loss_std=6.22e-05, val_loss=0.507]
2/100(t): 100%|██████████| 2/2 [00:00<00:00, 2.89it/s, running_loss=0.654, precision=0.0101, recall=0.0667, loss_std=0.0483, loss=0.458]
2/100(v): 100%|██████████| 1/1 [00:00<00:00, 18.95it/s, val_precision=0.0133, val_recall=0.0344, val_loss_std=3.43e-21+5.61e-05j, val_loss=0.314]
Traceback (most recent call last):
File "/home/fernandocamargo/datascience_workspace/recommendation-system/test.py", line 6, in <module>
task.run()
File "/home/fernandocamargo/datascience_workspace/recommendation-system/recommendation/task/base.py", line 77, in run
self.train()
File "/home/fernandocamargo/datascience_workspace/recommendation-system/recommendation/task/base.py", line 127, in train
callbacks=self._get_callbacks())
File "/home/fernandocamargo/anaconda3/envs/recommendation-system/lib/python3.6/site-packages/torchbearer/torchbearer.py", line 209, in fit_generator
_callbacks.on_end_epoch(state)
File "/home/fernandocamargo/anaconda3/envs/recommendation-system/lib/python3.6/site-packages/torchbearer/callbacks/callbacks.py", line 281, in on_end_epoch
self._for_list(lambda callback: callback.on_end_epoch(state))
File "/home/fernandocamargo/anaconda3/envs/recommendation-system/lib/python3.6/site-packages/torchbearer/callbacks/callbacks.py", line 191, in _for_list
function(callback)
File "/home/fernandocamargo/anaconda3/envs/recommendation-system/lib/python3.6/site-packages/torchbearer/callbacks/callbacks.py", line 281, in <lambda>
self._for_list(lambda callback: callback.on_end_epoch(state))
File "/home/fernandocamargo/anaconda3/envs/recommendation-system/lib/python3.6/site-packages/torchbearer/callbacks/tensor_board.py", line 97, in on_end_epoch
self._writer.add_scalar('epoch/' + metric, state[torchbearer.METRICS][metric], state[torchbearer.EPOCH])
File "/home/fernandocamargo/anaconda3/envs/recommendation-system/lib/python3.6/site-packages/tensorboardX/writer.py", line 272, in add_scalar
self.file_writer.add_summary(scalar(tag, scalar_value), global_step)
File "/home/fernandocamargo/anaconda3/envs/recommendation-system/lib/python3.6/site-packages/tensorboardX/summary.py", line 88, in scalar
scalar = float(scalar)
TypeError: can't convert complex to float
When the val_loss was 0.507, the val_loss_std was 6.22e-05. But when the val_loss was 0.314, the error happened and the val_loss_std was 3.43e-21+5.61e-05j.
My current workaround was to disable the std for the loss, adding this in my code:
@metrics.default_for_key('loss')
@metrics.running_mean
@metrics.mean
class SimpleLossFactory(metrics.MetricFactory):
def build(self):
return Loss()
Does the error still occur when you have more than 1 validation sample?
For single samples or multiple samples with the same value then precision errors casting from pytorch floats to python floats can give us negative variances.
For the moment I'll set it to return a variance of 0 for these situations, checkout branch fix/std_complex in the meantime until we merge this.
Closed by #296
I'm using torchbearer with PyTorch 0.4 and TensorboardX 1.2. Previously, I was using PyTorch 0.4.1, but I had to downgrade to use the TensorboardX because of a incompatibility with them. After adding the Tensorboard callback, the following error is raised after training for some time:
{TypeError}can't convert complex to float
When debugging, I noticed that the add_scalar() of TensorboardX tried to convert the scalar to float and, somehow, the val_loss_std was a complex number. Is there and error in how the std is calculated in order to result in a complex number?