mapbox / robosat

Semantic segmentation on aerial and satellite imagery. Extracts features such as: buildings, parking lots, roads, water, clouds
MIT License
2.02k stars 383 forks source link

Division by zero error in training related to number of samples #124

Closed wboykinm closed 6 years ago

wboykinm commented 6 years ago

Getting

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/app/robosat/tools/__main__.py", line 60, in <module>
    args.func(args)
  File "/app/robosat/tools/train.py", line 147, in main
    val_hist = validate(val_loader, num_classes, device, net, criterion)
  File "/usr/lib/python3.5/contextlib.py", line 30, in inner
    return func(*args, **kwds)
  File "/app/robosat/tools/train.py", line 245, in validate
    "mcc": metrics.get_mcc(),
  File "/app/robosat/metrics.py", line 66, in get_mcc
    (self.tp + self.fp) * (self.tp + self.fn) * (self.tn + self.fp) * (self.tn + self.fn)
ZeroDivisionError: float division by zero

. . . following the completion of a single epoch (out of a total of 1) in this workflow. It's unclear from the traceback where exactly the zero is being invoked, or how it could be fixed in either the config .tomls or the inputs. It seems related to this division, but it's not clear how a situation where num_samples = 0 could get through the validity check just above it.

cc @jacquestardie

daniel-j-h commented 6 years ago

Looks like the error is coming from

https://github.com/mapbox/robosat/blob/036e2aef336acec033e23ffed4324ff0997e5d94/robosat/metrics.py#L59-L67

It can happen if tp+fp=0, tp+fn=0, tn+fp=0, or tn+fn=0.

In this case we probably should just return float("nan") - what do you think @ocourtin?

ocourtin commented 6 years ago

@daniel-j-h

Indeed the metrics should be more robust to division by zero...

I've just created a new branch with a related fix: https://github.com/ocourtin/robosat/tree/nan https://github.com/mapbox/robosat/compare/master...ocourtin:nan

@wboykinm Could you give a try ?

wboykinm commented 6 years ago

@ocourtin That seems to have done the trick on the division by zero error! Thanks for adding the handler!

(Of course I'm on to newer and bigger failures, but that doesn't appear related and I'll try some debugging before I whine about that one.)

ocourtin commented 6 years ago

@daniel-j-h New related PR: https://github.com/mapbox/robosat/pull/127

@wboykinm Thanks for the test and report, and replyied on the 'newer and bigger' one...

daniel-j-h commented 6 years ago

127 resolves this issue - thanks @ocourtin! Let's keep the discussion for the worker problem in https://github.com/mapbox/robosat/issues/126.