nlpodyssey / cybertron

Cybertron: the home planet of the Transformers in Go
BSD 2-Clause "Simplified" License
280 stars 26 forks source link

data race detected by `go run -race` in textencoding #11

Closed JackKCWong closed 1 year ago

JackKCWong commented 1 year ago

When I do a simple inference, I found a data race by using go run -race .

The message is pretty long but basically all look like the below:

==================
WARNING: DATA RACE
Write at 0x00c004de27a8 by goroutine 127:
  github.com/nlpodyssey/spago/ag.ReleaseGraph()
      /home/lighthouse/go/pkg/mod/github.com/nlpodyssey/spago@v1.0.1/ag/release.go:25 +0x1e4
  github.com/nlpodyssey/spago/ag.ReleaseGraph()
      /home/lighthouse/go/pkg/mod/github.com/nlpodyssey/spago@v1.0.1/ag/release.go:21 +0x188
  github.com/nlpodyssey/spago/ag.ReleaseGraph()
      /home/lighthouse/go/pkg/mod/github.com/nlpodyssey/spago@v1.0.1/ag/release.go:21 +0x188
  github.com/nlpodyssey/spago/ag.ReleaseGraph()
      /home/lighthouse/go/pkg/mod/github.com/nlpodyssey/spago@v1.0.1/ag/release.go:21 +0x188
  github.com/nlpodyssey/spago/ag.ReleaseGraph()
      /home/lighthouse/go/pkg/mod/github.com/nlpodyssey/spago@v1.0.1/ag/release.go:21 +0x188
  github.com/nlpodyssey/spago/ag.ReleaseGraph()
      /home/lighthouse/go/pkg/mod/github.com/nlpodyssey/spago@v1.0.1/ag/release.go:21 +0x188
  github.com/nlpodyssey/spago/ag.ReleaseGraph()
      /home/lighthouse/go/pkg/mod/github.com/nlpodyssey/spago@v1.0.1/ag/release.go:21 +0x188
  github.com/nlpodyssey/cybertron/pkg/tasks/textencoding/bert.(*TextEncoding).Encode.func1.1()
      /home/lighthouse/go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.2/pkg/tasks/textencoding/bert/textencoding.go:78 +0x64

Previous read at 0x00c004de27a8 by goroutine 128:
  github.com/nlpodyssey/spago/ag.(*Operator).forward()
      /home/lighthouse/go/pkg/mod/github.com/nlpodyssey/spago@v1.0.1/ag/operator.go:194 +0xc7
  github.com/nlpodyssey/spago/ag.NewOperator.func1()
      /home/lighthouse/go/pkg/mod/github.com/nlpodyssey/spago@v1.0.1/ag/operator.go:58 +0x39
matteo-grella commented 1 year ago

Does it give you the same using version v0.1.3-0.20230219111654-ef2ca134a6d3 in your project?

JackKCWong commented 1 year ago

Yes it still happens on HEAD. I can see an issue here: https://github.com/nlpodyssey/cybertron/blob/main/pkg/tasks/textencoding/bert/textencoding.go#L78 since ag.ReleaseGraph says it's not concurrency safe. If I remove those lines the data race goes away but I suppose that way introduces memory leak.

matteo-grella commented 1 year ago

I see, thanks for the heads up!

Let us think, I've never been convinced how we handle memory in the computational graph.

@jjviana @marco-nicola @edobrambilla @evanmcclure

acim commented 1 year ago

This also happens in bert/tokenclassification.go at line 107 of the main branch. If I add:

            op.mx.Lock()
            op.cond.L = nil
            op.mx.Unlock()

to the RealeaseGraph function (in spago), data race is gone but since you have this comment:

    // mx is the mutex used by cond.
    // It's defined here in order to avoid an extra memory allocation
    // (from NewOperator), but it's never be used directly.

I am not sure this is the right solution. However, it would be nice to have this function concurrency safe and get rid of the data races.

acim commented 1 year ago
            op.cond.L.Lock()
            op.cond.L = nil

This also works, without unlock since the mutex has gone to nil, but again I am not sure this doesn't make some other trouble.

matteo-grella commented 1 year ago

@JackKCWong and @acim, kindly check the latest main branch to confirm if the data races during inference have been resolved. Please note that the serialized model is not backwards compatible, and as such, you will need to reconvert the model by removing the repo folder within the model folder as well as the spago_model.bin file.

matteo-grella commented 1 year ago

@JackKCWong please feel free to close the issue is the problem is fixed

acim commented 1 year ago

Works fine now. It seems that you have resolved some other problem as well because before this Kubernetes used to kill the container with cybertron for SystemOOM reason but this doesn't happen anymore. I didn't figure out was it memory leaking or just segmentation fault, didn't see it in the logs but anyways this look much better now. Thank you.

acim commented 1 year ago

Actually, I just tried it again now, using docker-compose, I deleted the models volume first and basically started it from scratch. I got this data race:

{"level":"debug","url":https://huggingface.co/dslim/bert-base-NER/resolve/main/pytorch_model.bin,"destination":"/models/dslim/bert-base-NER/pytorch_model.bin","time":"2023-04-27T09:33:07Z","message":"downloading"}

==================

WARNING: DATA RACE

Read at 0x00c00108d9b8 by goroutine 59:

  github.com/nlpodyssey/cybertron/pkg/downloader.(*downloadProgress).reportProgress()

     [ /go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadprogress.go:65](mailto:/go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadprogress.go:65) +0x65

 github.com/nlpodyssey/cybertron/pkg/downloader.(*downloadProgress).goRoutine()

     [ /go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadprogress.go:58](mailto:/go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadprogress.go:58) +0x70

  github.com/nlpodyssey/cybertron/pkg/downloader.(*downloadProgress).Start.func1()

     [ /go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadprogress.go:37](mailto:/go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadprogress.go:37) +0x39

intelligence-center           |

Previous write at 0x00c00108d9b8 by main goroutine:

  github.com/nlpodyssey/cybertron/pkg/downloader.(*downloadProgress).Write()

     [ /go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadprogress.go:82](mailto:/go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadprogress.go:82) +0x64

  io.(*teeReader).Read()

      /usr/local/go/src/io/io.go:615 +0xbe

  io.copyBuffer()

      /usr/local/go/src/io/io.go:427 +0x28d

  io.Copy()

      /usr/local/go/src/io/io.go:386 +0x88

  os.genericReadFrom()

      /usr/local/go/src/os/file.go:161 +0x34

  os.(*File).ReadFrom()

      /usr/local/go/src/os/file.go:155 +0x324

  io.copyBuffer()

      /usr/local/go/src/io/io.go:413 +0x1c5

  io.Copy()

      /usr/local/go/src/io/io.go:386 +0x984

  github.com/nlpodyssey/cybertron/pkg/downloader.downloader.downloadFile()

     [ /go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadmodel.go:153](mailto:/go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadmodel.go:153) +0x87c

  github.com/nlpodyssey/cybertron/pkg/downloader.downloader.downloadModelSpecificFiles()

     [ /go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadmodel.go:108](mailto:/go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadmodel.go:108) +0x23a

  github.com/nlpodyssey/cybertron/pkg/downloader.downloader.download()

     [ /go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadmodel.go:88](mailto:/go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadmodel.go:88) +0x294

  github.com/nlpodyssey/cybertron/pkg/downloader.Download()

     [ /go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadmodel.go:60](mailto:/go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadmodel.go:60) +0x1ae

  github.com/nlpodyssey/cybertron/pkg/tasks.loader[...].download()

     [ /go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/tasks/loader.go:257](mailto:/go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/tasks/loader.go:257) +0xad

  github.com/nlpodyssey/cybertron/pkg/tasks.loader[...].load()

     [ /go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/tasks/loader.go:88](mailto:/go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/tasks/loader.go:88) +0x2a4

  github.com/nlpodyssey/cybertron/pkg/tasks.Load[...]()

     [ /go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/tasks/loader.go:44](mailto:/go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/tasks/loader.go:44) +0x1c8

  REDACTED/internal/chat.NewNER()

      /app/internal/chat/ner.go:28 +0x3e

  main.mustCreateServerAPI()

      /app/server_api.go:70 +0xd09

  main.main()

      /app/main.go:138 +0x784

intelligence-center           |

Goroutine 59 (running) created at:

  github.com/nlpodyssey/cybertron/pkg/downloader.(*downloadProgress).Start()

     [ /go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadprogress.go:37](mailto:/go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadprogress.go:37) +0xfa

  github.com/nlpodyssey/cybertron/pkg/downloader.downloader.downloadFile()

     [ /go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadmodel.go:150](mailto:/go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadmodel.go:150) +0x831

  github.com/nlpodyssey/cybertron/pkg/downloader.downloader.downloadModelSpecificFiles()

     [ /go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadmodel.go:108](mailto:/go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadmodel.go:108) +0x23a

  github.com/nlpodyssey/cybertron/pkg/downloader.downloader.download()

     [ /go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadmodel.go:88](mailto:/go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadmodel.go:88) +0x294

  github.com/nlpodyssey/cybertron/pkg/downloader.Download()

     [ /go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadmodel.go:60](mailto:/go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/downloader/downloadmodel.go:60) +0x1ae

  github.com/nlpodyssey/cybertron/pkg/tasks.loader[...].download()

     [ /go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/tasks/loader.go:257](mailto:/go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/tasks/loader.go:257) +0xad

  github.com/nlpodyssey/cybertron/pkg/tasks.loader[...].load()

      [/go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/tasks/loader.go:88](mailto:/go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/tasks/loader.go:88) +0x2a4

  github.com/nlpodyssey/cybertron/pkg/tasks.Load[...]()

     [ /go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/tasks/loader.go:44](mailto:/go/pkg/mod/github.com/nlpodyssey/cybertron@v0.1.3-0.20230425155428-a934f3ba723b/pkg/tasks/loader.go:44) +0x1c8

  REDACTED/internal/chat.NewNER()

      /app/internal/chat/ner.go:28 +0x3e

  main.mustCreateServerAPI()

      /app/server_api.go:70 +0xd09

  main.main()

      /app/main.go:138 +0x784

Downloading other files from this model was fine, it's just with this one.

JackKCWong commented 1 year ago

@JackKCWong please feel free to close the issue is the problem is fixed

I am out for a week on holiday, will have a try as soon as I am back.

JackKCWong commented 1 year ago

I can confirm that the race condition in textencoding is fixed at v0.1.3-0.20230429163120-6369de12618c

Thanks!