sdobber / FluxArchitectures.jl

Complex neural network examples for Flux.jl
MIT License
124 stars 15 forks source link

Issues training models on FluxBench #35

Closed DhairyaLGandhi closed 2 years ago

DhairyaLGandhi commented 2 years ago

Ref https://buildkite.com/julialang/fluxbench-dot-jl/builds/84#627a51df-aa8d-4357-963d-877ff8fdef41

some models seem to fail. this is likely due to changes made to Zygote. Is it possible to do a bisect to find out which version of Zygote broke differentiability of certain models? We found this in Metalhead as well, but I would love to have a good idea of what caused this. It is likely aligned with when Zygote started wrapping ChainRules' outputs.

sdobber commented 2 years ago

I'll have a look in the next few days. Reminds me a bit of #22, where going from Zygote 0.6.21 to 0.6.22 gave ChainRulesCore.NoTangent errors.

DhairyaLGandhi commented 2 years ago

Okay, so I am running with 0.6.22 now as well. This needs an mwe so we can inspect it in zygote.

sdobber commented 2 years ago

The old issue was fixed in a later version of Zygote. The example code to cause the error is still available here. IIRC, it was hard to get it down to a "minimal" example - those seemed to work fine. There is also an old discussion in SliceMaps.jl.

sdobber commented 2 years ago

Are you sure that this is related to Zygote? I have two branches in my repo:

I updated my faonly branch to use the latest FluxArchitectures and Zygote 0.6.33, and the CPU benchmarks run fine. When I use the faonly_up3 branch however, I get the error that you saw as well with the same version of Zygote.

It might be worth having a look at the diff, especially the Manifest. I can for example see that there are different versions of CUDA and ChainRules. Do you think that these can be the most likely candidates for the error?

DhairyaLGandhi commented 2 years ago

It had to do with ChainRules breaking code downstream. The latest FluxBench should be working and has the FA benchmarks in there as well. Thanks for the help!