nlpodyssey / spago

Self-contained Machine Learning and Natural Language Processing library in Go
BSD 2-Clause "Simplified" License
1.74k stars 86 forks source link

v1.0.0-alpha.0: operator gradients are sometimes null in non-obvious ways #111

Closed jjviana closed 2 years ago

jjviana commented 2 years ago
  1. Download the imdb sentiments dataset from here: https://drive.google.com/drive/folders/1PZUyks3g1rvoSR9ZSaWr7hy61NQifcoq?usp=sharing
  2. Try to train a model with the following parameters: ./perceiver train -i train.shuf.csv --test-file validation.csv -j 32 -s 21 -k 2 -g 1 -e 1 -o imdb-minimal-2l.model -n 1

At some random point in training the following error will occur: goroutine 4813637 [running]: github.com/nlpodyssey/spago/mat.(*Dense[...]).Rows()

:1 +0x9 github.com/nlpodyssey/spago/mat.SameDims[...]({0x1584880?, 0x0}, {0x1584880, 0xc00b16d950}) /Users/julianoviana/Development/spago/mat/matrix.go:231 +0x38 github.com/nlpodyssey/spago/mat.(*Dense[...]).AddInPlace(0x0, {0x1584880, 0xc00b16d950}) /Users/julianoviana/Development/spago/mat/dense.go:514 +0x5f github.com/nlpodyssey/spago/ag.(*Operator[...]).AccGrad(0xc0588a3b00, {0x1584880, 0xc00b16d950}) /Users/julianoviana/Development/spago/ag/operator.go:165 +0x122 github.com/nlpodyssey/spago/ag/fn.(*Mul[...]).Backward.func2() /Users/julianoviana/Development/spago/ag/fn/mul.go:64 +0x18d created by github.com/nlpodyssey/spago/ag/fn.(*Mul[...]).Backward /Users/julianoviana/Development/spago/ag/fn/mul.go:58 +0x2a5 Debugging the problem I have found that in operator.go it is possible to observe o.grad!=nil and at the same time reflect.ValueOf(o.grad).IsNil()==true. That means somewhere a nil pointer is being cast to mat.Matrix[t] and stored in o.grad. Since mat.Matrix is an interface, the ==nil test will return false but any method call will panic as mat.Dense does not consider the possibility of a nil pointer value. The following seems to fix it, albeit with the use of reflection: ``` diff --git a/ag/operator.go b/ag/operator.go index d0167911..47c39738 100644 --- a/ag/operator.go +++ b/ag/operator.go @@ -157,11 +157,12 @@ func (o *Operator[T]) AccGrad(grad mat.Matrix[T]) { o.cond.L.Lock() defer o.cond.L.Unlock() - if o.grad == nil { + if o.grad == nil || reflect.ValueOf(o.grad).IsNil() { o.cond.L.Unlock() o.grad = o.Value().ZerosLike() o.cond.L.Lock() } + o.grad.AddInPlace(grad) if o.inBackward && atomic.AddInt64(&o.pendingGrads, -1) == 0 { ``` Not sure this is an acceptable fix - but couldn't find any place where this weird grad is being set either...